Is your data lake open enough? What to watch out for

Like yesterday’s data warehouses, today’s data lakes threaten to lock us into proprietary formats and systems that restrict innovation and raise costs

A data lake is a system or repository that stores data in its raw format along with transformed, trusted data sets, and provides both programmatic and SQL-based access to this data for diverse analytics tasks such as data exploration, interactive analytics, and machine learning. The data stored in a data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs), and binary data (images, audio, video).

A challenge with data lakes is not getting locked into proprietary formats or systems. This lock-in restricts the ability to move data in and out for other uses or to process data using other tools, and can also tie a data lake to a single cloud environment. That’s why businesses should strive to build open data lakes, where data is stored in an open format and accessed through open, standards-based interfaces. Adherence to an open philosophy should permeate every aspect of the system, including data storage, data management, data processing, operations, data access, governance, and security. 

An open format is one based on an underlying open standard, developed and shared through a public, community-driven process without vendor-specific proprietary extensions. For example, an open data format is a platform-independent, machine-readable data format, such as ORC or Parquet, whose specification is published to the community, such that any organization can create tools and applications to read data in the format.

A typical data lake has the following capabilities:

  • Data ingestion and storage
  • Data processing and support for continuous data engineering
  • Data access and consumption
  • Data governance including discoverability, security, and compliance
  • Infrastructure and operations

In the following sections, we will describe openness requirements for each capability.

Data ingestion and storage

An open data lake ingests data from sources such as applications, databases, data warehouses, and real-time streams. It formats and stores the data into an open data format, such as ORC and Parquet, that is platform-independent, machine-readable, optimized for fast access and analytics, and made available to consumers without restrictions that would impede the re-use of that information. 

An open data lake supports both pull-based and push-based ingestion of data. It supports pull-based ingestion through batch data pipelines and push-based ingestion through stream processing. For both these types of data ingestion, an open data lake supports open standards such as SQL and Apache Spark for authoring data transformations. For batch data pipelines, it supports row-level inserts and updates—UPSERT—to data sets in the lake. Upsert capability with snapshot isolation—and more generally, ACID semantics—greatly simplifies the task, as opposed to rewriting data partitions or entire data sets. 

The ingest capability of an open data lake ensures zero data loss and writes exactly-once or at-least-once, handles schema variability, writes in the most optimized data format into the right partitions, and provides the ability to re-ingest data when needed.

Data processing and support for continuous data engineering

An open data lake stores the raw data from various data sources in a standardized open format. However, use cases such as data exploration, interactive analytics, and machine learning require that the raw data be processed to create use-case driven trusted data sets. For data exploration and machine learning use cases, users continually refine data sets for their analysis needs. As a result, every data lake implementation should enable users to iterate between data engineering and use cases such as interactive analytics and machine learning. This can be thought of as continuous data engineering, which involves the interactive ability to author, monitor, and debug data pipelines. In an open data lake, these pipelines are authored using standard interfaces and open source tools such as SQL, Python, Apache Spark, and Apache Hive.

Data access and consumption

The most visible outcome of the data lake is the types of use cases it enables. Whether the use case is data exploration, interactive analytics, or machine learning, access to data is vital. The access to data can be through SQL or programmatic languages such as Python, R, and Scala. While SQL is the norm for interactive analysis, programmatic languages are used for more advanced applications like machine learning and deep learning. 

An open data lake supports data access through a standards-based implementation of  SQL with no proprietary extensions. It enables external tools to access that data through standards such as ODBC and JDBC. Also, an open data lake supports programmatic access to data via standard programming languages such as R, Python, and Scala, and standard libraries for numerical computation and machine learning, such as TensorFlow, Keras, PyTorch, Apache Spark MLlib, MXNet, and Scikit-learn.

Data governance – discoverability, security, and compliance

When data ingestion and data access are implemented well, data can be made widely available to users in a democratized fashion. When multiple teams start accessing data, data architects need to exercise oversight for governance, security, and compliance purposes. 

Data discovery

Data itself is hard to find and comprehend and not always trustworthy. Users need the ability to discover and profile data sets for integrity before they can trust them for their own use case. A data catalog enriches metadata through different mechanisms, uses it to document data sets, and supports a search interface to aid discovery.

Since the first step is to discover the required data sets, it’s essential to surface metadata to end-users for exploration purposes, to see where the data resides and what it contains, and to determine if it is useful for answering a particular question. Discovery includes data profiling capabilities that support interactive previews of data sets to shine a light on formatting, standardization, labels, data shape, and so on.

An open data lake should have an open metadata repository. As an example, the Apache Hive metadata repository is an open repository that prevents vendor lock-in for metadata.

Security

Increasing accessibility to the data requires data lakes to support strong access control and security features. To be open, a data lake should do this through non-proprietary security and access control APIs. As an example, deep integration with open source frameworks such as Apache Ranger and Apache Sentry can facilitate table-level, row-level, and column-level granular security. This enables administrators to grant permissions against already-defined user roles in enterprise directories such as Active Directory. By basing access control on open source frameworks, open data lakes avoid vendor lock-in that results from a proprietary security implementation.

Compliance

New or expanded data privacy regulations, such as GDPR and CCPA, have created new requirements around “Right to Erasure” and “Right to Be Forgotten.” These govern consumers’ rights about their data and involve stiff financial penalties for non-compliance (as much as four percent of global turnover), so they must not be overlooked. Therefore, the ability to delete specific subsets of data without disrupting a data management process is essential. An open data lake supports this ability through open formats and open metadata repositories. In this way, they enable a vendor-agnostic solution to compliance needs.

Infrastructure and operations

Whether the data lake is deployed in the cloud or on-premises, each cloud provider has a specific implementation to provision, configure, monitor, and manage the data lake as well as the resources it needs. An open data lake is cloud-agnostic and portable across any cloud-native environment, including public and private clouds. This allows administrators to leverage the benefits of both public and private cloud from an economics, security, governance, and agility perspective.  

Open for innovation

The increase in the volume, velocity, and variety of data, combined with new types of analytics and machine learning, make data lakes a necessary complement to more traditional data warehouses. Data warehouses exist largely in a world of proprietary formats, proprietary SQL extensions, and proprietary metadata repositories, and lack programmatic access to data. Data lakes don’t need to follow this proprietary path, which leads to restricted innovation and higher costs. A well designed, open data lake provides a robust, future-proof data management system that supports a wide range of data processing needs including data exploration, interactive analytics, and machine learning.

Ashish Thusoo is co-founder and chief executive officer of Qubole. Before co-founding Qubole, Ashish ran Facebook’s data infrastructure team. Under his leadership, the Facebook data infrastructure team built one of the largest data processing and analytics platforms in the world and created a host of tools, technologies, and templates that are used throughout the industry today.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Copyright © 2020 IDG Communications, Inc.