Redefining Data Lakes: Is AWS Lake Formation the Answer to Past Troubles?

Over the past years the term ‘data lake’ has dominated many conversations about data and rightfully so. Whether it is used as a buzzword or not, whether you are a data engineer, or your business handles loads of data, your way of ‘storing’ the data is probably what is called a data lake. But what is a “data lake” exactly?

So, my first instinct is to Google “What is a data lake?” This search will likely explain that a data lake is a central repository for structured and unstructured data in raw format while it also takes care of processing and securing the data. However, what is often left unsaid is the ‘how’ of it all. Storing, processing, and securing your data is something you have to figure out on your own. So yes, building a data lake is not as straightforward as it may seem, due to numerous manual steps and configurations. Is there a way to set up your data lake in such a way that everything is managed for you? Yes, there is, enter AWS Lake Formation!

Automatic data lakes, or maybe semi-automatic?

When building our data lake using Lake Formation, we can trust that many complex tasks like governance, data cataloging and permissions are managed for us. However, this does require some initial manual setup. It is important to be cautious with promises of ‘automatic’ solutions, but Lake Formation does come pretty close. The key to its effectiveness lies in two main AWS services: its own permission model (built on top of the IAM permission model) and AWS S3 storage. There is actually a third one but it’s too early to reveal it yet…

Three stages of creating a data lake using Lake Formation – S3 and the permission model

On top of the solution stands the data lake administrator. Why are they so important? This is the user who can register where the data lake is stored, query the Data Catalog, create databases and more. This makes them key in populating the data lake, managing access permissions, and ensuring its smooth operation.

Let’s create an IAM user with administrator privileges, which we will then select as our data lake administrator. We can create this super-user very quickly in the IAM section of the AWS web interface, but for the sake of demonstration, let’s use the AWS CloudShell:

With this user created, the next question is: where exactly is the created data lake going to reside?  This is where S3 comes into play, the second building block. And this is where it gets a bit confusing because Lake Formation doesn’t actually copy or move data to S3. Your data might be in S3 buckets or on-premise, Lake Formation only identifies the data you want in your data lake and records it in the Data Catalog, sometimes referred to as the data lake.

First, we need some test data to populate our data lake. For the sake of my own nostalgia, let’s use annotation data of drug development. We run the following command using AWS CloudShell:

and

All the building blocks for Lake Formation are now in place, time to hop on the AWS GUI and create our data lake.

Step 1: Adding our super-user as data lake administrator
Step 2: Registering our S3 bucket with drugs annotation data as a data lake location
Step 3: Granting our super-user access to our S3 bucket holding our drug annotations data
Step 4: Creating the database that will hold the Data Catalog (created by AWS Glue Crawlers)

The “Lake Formation – Glue” Tandem

No matter where your data is located—in S3 buckets, Amazon RDS databases, or on-premises—it can be integrated into the data lake. This is why AWS Glue is an integral part of Lake Formation as these 2 services work together in this process of data lake management.

An important feature of AWS Glue in Lake Formation are AWS Glue Crawlers. Crawlers automatically detect and catalog metadata from various data sources into what is called a Data Catalog.  By default, these AWS Glue Crawlers examine all the data in the data lake, including the already identified data. However, with Incremental Crawling, only new or changed data is analyzed and added to the Data Catalog.

Example of how AWS Glue creates a Data Catalog for a Lake Formation data lake.

To ensure your Data Catalog accurately represents the data in the data lake, you can set up Scheduled Crawls. These crawls run at regular intervals to keep the catalog up to date. As support for schema evolution is also a feature of AWS Glue, metadata is updated by the AWS Glue Crawlers to reflect these changes.

Let’s crawl our drug annotations data! This process is greatly simplified by just using the AWS GUI. Navigate to the AWS Glue page:

How to set up a Glue Crawler to crawl our uploaded drug annotations data

As you can see, it is inside the Glue Crawler creation flow that we identify the data sources we want to include in the Data Catalog. Other important things to note are how we select a location to save our data lake, create a Glue Crawler specific role and choose our scheduling frequency.

Lake Formation meets Athena

Integration of our data lake with Amazon Athena can be a very powerful feature for all data professionals. This allows us to quickly run ad hoc queries without complex configuration steps. Navigate over to Amazon Athena, there the AWSDataCatalog is by default available in our query window. Define a location to save our query results and select the correct database and our scanned table will be available too

Amazon Athena automatically loads the AWSDataCatalog.
Data from our Lake Formation data lake being queried in Amazon Athena

Conclusion

In summary, Lake Formation revolutionizes data lake management by automating the traditionally complex tasks of governance, data cataloging, and permissions. Its integration with AWS Glue and S3 streamlines the creation and upkeep of data lakes, allowing data professionals to focus on deriving insights instead of manual configurations. AWS Glue and Lake Formation ensure that the data lake solution is versatile, not limited to just AWS-hosted data. Features like AWS Scheduled Crawls and schema evolution automate metadata discovery and updates, optimizing efficiency and allowing for more data-centric tasks.

My personal experience with Lake Formation was positive. I spent less time organizing my data and managing its flow, thanks to its integration with tools like Amazon Athena, AWS Glue Studio, Amazon QuickSight, and Amazon SageMaker Studio. This integration provides a robust environment for data professionals, facilitating tasks like ad-hoc queries, ETL-job development, visualization creation, and machine learning workflow execution.

Does Lake Formation redefine data lakes? Not entirely. There were moments I still puzzled over certain IAM roles and security policies, where exactly our data was residing and loaded to … But it does provide a powerful platform for data analysts, data engineers, business intelligence analysts, and data scientists. It allows them to delve deeper into their data, focusing on meaningful narratives rather than getting bogged down in technical difficulties.

Ilias Bukraa

I am Ilias Bukraa and I love data engineering (sometimes…) With a background in bioinformatics and passion in code, I started working at Aivix with the goal and ambition to use this passion and previous data engineering experience on cloud projects. At Aivix I received opportunities to obtain certification in both Databricks and AWS, so let’s see where we end up!