Categories: Coding

by Ivan Vankov

Share

by Ivan Vankov

Data engineering is an ever-evolving field in which new concepts and ideas constantly emerge. Although the terms data lake, data warehouse, and data lakehouse are closely related, they denote different architectural approaches. As organizations depend on data more than ever before, understanding the differences between these architectural approaches is essential. In this post, we’ll break down the key architectural differences between these concepts. Let’s dive in!

Data Warehouse

The concept of the data warehouse emerged more than 50 years ago. Its main idea is to provide access to high-quality, trustworthy data that facilitates organization-level decision-making. Thus, the functions of a data warehouse include:

  • Data consolidation: the data is collected from various data sources, operational databases, files, and different external source systems.
  • Data storage: A data warehouse uses a relational database to store the data in a format suitable for efficient querying.
  • Data consumption: building reports directly or using business intelligence tools for flexible reporting and analysis.

The high-level architecture:

A typical modern data warehouse may consist of the following components:

  • A tool for data extraction (Fivetran, Ayrbyte, Matillion, Stitch, and many more).
  • A relational database for the central storage layer (BigQuery, Snowflake, Redshift, and many more).
  • A tool for data transformation and delivery to the consumers (Dataform, dbt, Talend, and many more).
  • A tool for workflow orchestration (Airflow, Dagster, Digdag, and many more).
  • A BI and reporting tools (Looker, Power BI, and many more).

The primary users of the data warehouse include top managers, marketing and sales specialists, and other decision-makers who determine the direction of the company’s business.

Data Lake

In the 2010s new approaches to data analysis emerged and have gained significant popularity, along with the new roles: Data Scientist and Data Analyst. The new analysis did not require expensive data preparation and could be performed on raw, unstructured or semi-structured data without a strict format. About the same time scalable data processing solutions like Hadoop, Spark, Kafka and scalable analytical database management systems have been developed and popularized, which unlocked access to processing significantly larger amounts of data than was possible with the classical data warehouses.

Data lakes allow the storage of data in any format, CSV, JSON, or more efficient formats like Avro and Parquet. They did not have strict data quality requirements and data consistency guarantees. As a result, to satisfy the demand of all data consumers data lake has often been introduced along with the classical data warehouse:

In such architecture, in addition to classical data sources, large amounts of data from devices or user-facing services are loaded in the data lake.

Maintaining both relational database and unstructured storage which can contain a lot of duplicate data is expensive, while the use cases for the data are still different. It would be better to somehow optimize this architecture.

Data Lakehouse

The further evolution of data lakes revealed problems with data consistency and data quality: a data lake could become a “data swamp”. Unlike a data warehouse built on a relational database, a data lake does not fully provide ACID (atomicity, consistency, isolation, durability) guaranties. For example, in case of a bad data pipeline design, once there is a problem during data loading or transformation, the old data may become unavailable while the new data has not yet been delivered successfully. Or, if the new data is successfully delivered, Analyst or other data consumer may discover that the data schema has changed, and now there is an inconsistency in the file structure. In the absence of metadata, it may become unclear what the data structure is.

To address these problems, new abstractions called table formats have been introduced. Apache Iceberg, Databrick’s Delta Table, and Apache Hudi are three popular table formats. These table formats are built on top of the most popular data lake data formats (Parquet, ORC, Avro) and provide ACID guarantees, transparent data schema evolution, metadata management, security, and many other features. All these features make data lakes much closer to data warehouses.

At the same time, many relational database management systems, including popular analytical databases used to build data warehouses, now support unstructured data and machine learning workloads and allow decoupling storage and processing (i.e., querying the mentioned open table formats stored externally).

Evidently, the line between the data warehouse concept and the data lake becomes more and more blurred. Since we can now decouple the query engines from the storage and have high-quality data management in the data lake, we can introduce a new architecture of data lakehouse:

Now with the data lakehouse architecture we don’t have data duplication and keep only the components we need for our data analytics demands.

This way data lakehouse provides sufficient flexibility to meet various demands with efficiency. A sample data lakehouse can be built on platforms such as Databricks, Google Cloud Platform, or using open-source components. For example:

  • Data storage: object store in the cloud using Apache Iceberg.
  • Data pipelines: python/SQL notebooks or scripts orchestrated by Apache Airflow.
  • Query engines: Trino, BigQuery, Spark.
  • BI and data delivery tools are the same as those used in a data warehouse.

Conclusion

Data warehouse, data lake, and data lakehouse represent similar and interdependent architectures that emerged as a result of the evolution of data platforms. The key differences are dictated by how the data is stored in each of them, the target audience of the data, the volume, variety, and velocity of the data, and the approach to data management.

The evolution of data platforms continues. The area of data analytics is highly competitive and full of diverse solutions and tools, making it difficult to navigate. This overview of high-level data architectures sheds light on the direction and landscape of modern data platforms.

Share