The Data Lakehouse
Every year, a new buzzword emerges in Data and Analytics; get to know the ‘Data Lakehouse’. Will this be the future of all analytical projects, or will this just be another fuzz that will only be used in rare cases? This blog will guide you through the architecture of a Data Lakehouse, you will learn the differences with classical architectures and the opportunities a Data Lakehouse brings.
Why change the classical ‘Data Warehouse’?
Classical data warehouses have helped businesses automate reporting and demonstrated the added value of analytics. These warehouses combined countless operational sources and allowed for logging history. Querying is made easy by providing business domains in (Kimball) modelled star schemes. Yet, data that reached the Data Warehouse was often aggregated and constraints apply because of the ‘sizing’ of the server.
In a second stage, Data Scientists leveraged the accuracy of statistics and machine learning to create added-value predictions and segmentations. Data Scientists were not quite satisfied with the data presented by a Data Warehouse; Data Scientists need to analyze data on a non-aggregated level or require detailed unstructured log files which are absent in a DWH. To satisfy this need, Data Lakes allowed to store these rarely accessed dataflows and are thus made accessible for analytical purposes.
The age of the Data Lakehouse
A Data Lakehouse is single point of access for all data within your organization, both presented in a Star Schema for Business Analysts and raw files for Data Scientists. This architecture has only become viable through the separation of storage and compute. Due to this decoupling, compute engines can access unlimited amounts of data stored in your data lake.
Providing a single point of access for all analytical use cases will also function as a bridge between departments: experienced business analyst can explore statistical results and data scientists can convert statistical analysis faster to an accessible report.
Advantages of decoupling Storage and Compute
Traditional data warehouse architectures relied on traditional relational databases. These databases import the data and provide a very fast accessible layer of information. You would expect that adding more data to a database without the need for compute would only require more disk space, but unfortunately this is not the case. These traditional databases store certain data in a very clever fast accessible way (think of keys, indexes, caches) that consume non-disk resources (CPU & memory). Hence, feeding large volumes of data not only requires you to scale-up your disk storage but also more of the expensive resources such as CPU and memory.
Recent analytical architectures consist of two separate layers: a storage layer (typically a Data Lake using HDFS) and a compute layer. The storage layer can be filled with limitless structured and unstructured data without having the need to scale-up on cores and compute. The compute layer is a service that answers your queries by finding the data in the data lake, importing, and transforming it and providing you with the result. As the name implies, all compute intensive tasks are executed here. Hence, large volumes of storage can be put into a Data Lakehouse without having the need to improve on your computing power. Large advantages can be found in cloud architectures where the more-expensive compute layer can be scaled-up only when it is needed.
Why a ‘Data Lakehouse’ only now?
You may be wondering, why the Data Lakehouse is only emerging now?
Firstly, the technologies to efficiently virtualize files in a Data Lake has improved significantly on the past few years.
Secondly, less segregation can be noticed between Data Scientists and Business Users (Citizen Data Scientists). Business Users have tools at their disposal to create data science experiments. Additionally, they can also turn data residing in Lakes in added values without the need of statistics. Data Scientists on the other hand, are using the same tools as business users to visualize their results and make them available to the business.
Thirdly, all stakeholders (business users, data scientists, other applications, …) are requesting a faster access to the data.
As a fourth point, new technological offerings, and technologies (such as Synapse Serverless, SQL Server Polybase, Databricks Serverless SQL and SnowFlake) support such architectures in a cost-effective way.
When not to use a Data Lakehouse?
One could ask the question, should we now implement a Data Lakehouse in every new situation? Every architecture has its pros and cons; while a Data Lakehouse is the ideal solution for analytical workloads and platforms, it is not suited for transactional processes. Due to the decoupling of storage and compute, queries suffer from very small but noticeable delays as the data needs to be fetched from storage, loaded into compute to provide you with the correct result. Hence, using a Data Lakehouse architecture as a transactional process (= OLTP) would not be beneficial for business users.
A Data Lakehouse is a cost-effective way of providing a Data Platform for analytical workloads. The improving reliance of Data Scientists and Business Users fuel a revolution that requires decoupling storage and compute.