Streamlining data warehousing: Harnessing the power of Amazon Redshift with dbt integration

Nowadays, data engineers and scientists cannot keep up with all the ETL-tools popping up like mushrooms from the ground. Imagine a tool, not strictly an ETL tool, taking center stage in data transformation? With your data already extracted and loaded (the E and L in ETL) into your data warehouse, dbt excels in transforming the data. By expressing your data transformation in SQL and its support for many cloud data platforms such as BigQuery, Databricks, Snowflake, Azure Synapse and many more, dbt is here to stay. In this blogpost, we will be diving deeper into integrating dbt into one of those supported platforms, Amazon Redshift.

With Amazon Redshift, large scale analytics of your data warehouse is made easier than ever. Designed for petabytes of data, querying is done fast by using column-based instead of row-based storage and MPP (parallelisation of queries over multiple nodes).

With this brief introduction to dbt and Amazon Redshift, let’s experiment with both and integrate dbt into our own Aivix Redshift cluster!

Python to the rescue: installing and creating a dbt project

Python, favored by many data professionals, will serve as our foundation for the dbt installation process covered in this blog. For our environment we have an EC2 instance that is running our Amazon Redshift cluster. We SSH into the EC2 instance using our IDE of choice, AWS Cloud9 and run the following command to install dbt on the EC2 instance with our Redshift cluster:

With these commands, you can now launch a new dbt project and configurate it to be used on your Redshift cluster using the following command:

Important note:


When setting up the dbt project and connecting to the Redshift cluster, you must supply either the cluster identifier when you are working with a provisioned cluster or the endpoint when working with a serverless endpoint.

Setting up dbt using a provisioned or a serverless Redshift cluster.

Automating your SQL queries (and how it is possible)

One thing that really stands out when using dbt is the possibility of automating the creation of the base models. To be able to do this, two packages are needed: dbt-codegen and dbt-utils by dbt-labs. These packages are used to create schema information, telling you how many tables are available in your schema but also to build simple data models using the many functions available in the packages.

Let us take for example the generate_source function. When we run this function on our sample database and supply it with a schema name, we get a full overview of all the tables in that schema. Saving the automatically generated schema information is a crucial step if you want to (semi)-automatically create your data models.

Getting the schema information which can later be used to create base models.

Coupling the generate_source function with the generate_base_model function abstracts the need for human interaction in writing the SQL syntax for the base models. Keep in mind that this approach is suited for basic SELECT queries to construct your database models, and it may not cater to more complex scenarios. Additionally, it’s important to remember that the process isn’t entirely automated; it still requires input parameters when invoking the dbt-codegen function.

Example of how to build a base model.

Now one might ask, what makes it possible for these base models to be automatically created using the schema, which was also automatically created? Let me tell you about the wonderful capabilities of dbt macros. Imagine your standard Python functions or Java methods, dbt macros are ‘exactly’ the same; reusable pieces of SQL code that make complex or just frequently used SQL logic easily maintainable and reusable. Macros in dbt can be shared across projects and on top of that, they can be parameterised, making them dynamic in their behaviour. Why don’t you hop on over to the dbt-codegen package and check out some of the dbt macros they offer (TIP: generate_source and generate_base_model are dbt macros too!).     

Definition of a simple dbt macro

Models and materialisation

When we talk about models in a dbt context, we refer to a defined SQL query that transforms our data into a structured format. Let’s take our base models: these were built with simple SELECT statements from our schema information. Now we want to build more refined models instead of simple SELECT’s. The power of dbt allows us to call for these models inside other models; this is because the SQL logic for said model is saved as a SQL file that can be referenced.

Building custom models by referencing the base model in the SQL logic

The file system’s versatility enables the creation and storage of custom models with complex transformations in .sql files. Note that dbt has custom syntax that allows you to refer to files in the file structure not by using absolute or relative paths, but using the {{ref(‘<file name>’)}} syntax.

Let’s reflect on this: when we execute a custom model, what determines its persistence in our database? Will it create a new table, or will it create a view on top of our source data (base model)? In dbt, this phenomenon is called materialisation, and we can choose for ourselves what the best way of persisting our data is:

  • View: the default materialisation in dbt, retrieving data dynamically from source tables without persisting the results using create view as syntax.
  • Table: fully materialized tables storing the results of a SQL query in the data warehouse using create table as syntax.
  • Incremental: dbt identifies changes in the source data and updates your existing model to reflect these changes.
  • Seed: materialisation type used for static data that usually is loaded from CSV files.
  • Materialised view: a hybrid between Incremental and Table materialisation.

Documentation, documentation and documentation

Comprehensive documentation is indispensable throughout the development lifecycle. Especially in dbt when you have models that refer to other models it may become difficult to keep track of all the relations. That is why dbt reaches out a helping hand in the following 2 functions:

These two commands simplify documentation by scanning your models and their SQL content to track inter-model references. Launching a localhost webserver, you have a visual overview of all your models defined in your dbt solution with graphs showing references made to and from your custom models.

Conclusion

Utilizing dbt with Redshift offers numerous benefits and opens up even greater opportunities. Connecting dbt to your Amazon Redshift solution can immensely improve your transformation workloads. So far we have talked about connecting a dbt solution to our Redshift data warehouse, automating SQL queries, building complex models and more. But dbt has much more to offer, think of Git integration, CI/CD and how to customise your pipelines, creating hooks, … 

Ilias Bukraa

I am Ilias Bukraa and I love data engineering (sometimes…) With a background in bioinformatics and passion in code, I started working at Aivix with the goal and ambition to use this passion and previous data engineering experience on cloud projects. At Aivix I received opportunities to obtain certification in both Databricks and AWS, so let us see where we end up!