Top insights from the Databricks summit 2024

The annual Databricks Summit took place in San Francisco this year, spanning from June 10 to June 13. This event is the ideal place to get all the latest new releases, announcements, and exciting features about topics like Databricks, Delta Lake, Spark, and others. There were 10 different tracks, including “Data engineering and streaming”, “Data Governance”, and many more, all packed with sessions given by expert speakers. Business use cases, demos of innovative technologies, new announcements, you name it. Of course, we attended some of these sessions (online) and would like to elaborate on some of the main topics that caught our attention.

Figure 1 Official Databricks summit welcome page

Keynote By Ali Ghodsi (CEO)

The first and most important keynote was given by Ali Ghodsi (co-founder & CEO of Databricks). His session revolved around three main problems in the data world with their respective Databricks answer to tackle these.

1. The problem of a scattered data estate

There are many ways and formats to store your data, often with not a lot of difference between two formats. Very recently, Databricks acquired the company Tabular. A company known for their widely used open-source format Iceberg. This format bears some resemblance to the Databricks “in-house” format Delta. Now, with the combined forces of the knowledge on Delta and the knowledge on Iceberg, Ali wants to bring the two formats closer than ever. This will result in a single format that can be used for unified data storage. It really is combining the best of both worlds.

2. Data Security and Governance: open sourcing of Unity Catalog

The problem of data security and governance arises in many organizations. Databricks stresses the importance of their Unity Catalog project in this retrospect. Unity catalog does not just govern your tabular data. It also governs AI and machine learning models, supports lineage and data quality. Ali Ghodsi announces that Databricks will open-source Unity Catalog which is exciting. Consequently, people can now view and modify the software and build a community to spot issues and bring improvements. There are a lot of software examples (Airflow, Docker, Kubernetes) that show how the synergies of open sourcing the code can improve and solidify the product as a go-to tool. Especially with something as precious as Data Security and Governance, there is no doubt that Unity Catalog will also experience these benefits.

3. Democratize data and AI with Databricks mosaic AI

Companies struggle with the start of their AI journey. To tackle this, Databricks wants to democratize data and AI. Democratize data meaning that everybody in the organization can ask questions directly to the data instead of having to go via the data team. By using generative AI, an explanation of each data column can be generated. Things like this impact the ability to create prompts to ask questions to your data instead of having to ask a data engineer to write a complex SQL query. Democratizing AI means helping data practitioners develop AI models quickly and efficiently. This can be done by leveraging Databricks Mosaic AI. This tool helps developers in every AI stage from data preparation, model training and tuning to evaluation and governance of the model. Additionally, the AI/BI product will also help in this field, more on that later.

Figure 2 Ali Ghodsi (left) shaking Jen-Hsun Huangs (right) hand on the AI summit

NVIDIA Partnership

Another key point in the AI summit was related to the visit of Jen-Hsun Huang, the CEO of NVIDIA. Databricks announced an expanded collaboration with NVIDIA to optimize data and AI computing by leveraging NVIDIA CUDA accelerated computing to the core of Databricks platform. This core is called Photon, Databricks next generation engine that provides extremely fast query performance at low cost. Photon is the driving force behind Databricks SQL, a serverless data warehouse that boasts top-tier price-performance and lowers total cost of ownership. Both Databricks and NVIDIA anticipate that this collaboration will mark a new milestone in achieving exceptional price-performance ratios. With this strategic partnership, Databricks and NVIDIA are positioned to transform the landscape of data processing and enterprise AI, making advanced data analytics more accessible and cost-effective than ever before.

LakeFlow

At the summit Databricks was also thrilled to unveil Databricks LakeFlow, a comprehensive solution for building and operating production data pipelines. Before data can be transformed into actionable insights, data teams often need to ingest data from all kinds of sources in all kinds of formats. Maintaining all these connectors, ingestion patterns and pipelines can be a full-time job. Monitoring these flows also often requires some custom build framework. To mitigate these cumbersome jobs, Databricks has developed LakeFlow. It can be split up into 3 key components.

LakeFlow Connect

LakeFlow Connect provides a point-and-click data ingestion from databases (MySQL, Postgres…), enterprise applications (SalesForce, Microsoft Dynamics…) and extends the native connectors for cloud storage (S3, Azure DataLake) that are already in Databricks. Change Data Capture (CDC) is used as an ingestion pattern, guaranteeing a simple, reliable, and efficient loading of your data.

LakeFlow Pipelines

LakeFlow Pipelines lowers the complexity of building and maintaining pipelines. It is built on top of the Delta Live Tables framework but abstracts some of the technicalities to allow you to focus on writing the business logic and transformations that really provide value.

LakeFlow Jobs

To bring everything together, LakeFlow Jobs can be used to orchestrate your pipelines. On top of that, the extensive monitoring options will make sure nothing goes unnoticed.

Some extra features are the integration with Unity Catalog, the integration of the AI powered Databricks Assistant and the serverless compute. Databricks really focusses on this serverless compute in all their offerings (notebooks, jobs, workflows, …) to omit the hassle to select the perfect cluster for your specific use case

AI/BI: Intelligent Analytics for Real-World Data

Databricks AI/BI is a new type of business intelligence product to deeply understand your data’s semantics and enable anyone to analyze data for themselves. In a lot of companies, there exists a constant circle of business users wanting a new dashboard and data professionals providing such dashboard. But in the modern world, this cycle can never be executed fast enough, new questions arise before the previous ones can even be answered. Generative AI products have already been incorporated into BI tools, but the results have not been satisfactory yet. When confronted with messy data and ambiguous language, most Gen AI products fail to deliver a clear and concise answer.

Most of the info on how to handle these real problems exists in the heads of the people who are using the data on a day-to-day basis. Semantics, schemas, combinations, all those things are hard to put on paper, and even harder to input into a Gen AI prompt. The info comes pouring out in day-to-day interactions with the data, queries that are being written, dashboards that are being checked… Databricks AI/BI uses a compound AI system that captures and learns from all these interactions, in order to fully understand the semantics of the data in your company. A compound AI system means that it exists out of multiple small, but very specific agents that gather info. For example, you can have an agent for the Unity Catalog, one for the SQL queries, one for the planning. All these little components combined can make sure that every little detail is captured.

The users can then interact with the AI/BI model in two ways: AI/BI dashboards and Genie.

AI/BI dashboards provide a low-code experience to easily configure the data and charts you want. Genie is a more classical chatbot in which you can ask questions. Genie greatly benefits from human feedback to optimize the understanding of your company data.

Conclusion

The focus on AI is clearly an important topic for Databricks. This is emphasized by the partnership with NVIDIA, the additions to Mosaic AI and the introduction of the AI/BI product. Further, Databricks focuses on making the platform as convenient as possible, making sure that all different parts of a data team can do their work in the most efficient way possible, without losing too much time in the details. Of course, we could not cover everything mentioned on the summit but be sure to check out the Data & AI summit website to watch some of the more than 500 sessions. For all Databricks questions, feel free to contact us so we can help in your Databricks journey.

Sources:

Lakeflow blog: https://www.databricks.com/blog/introducing-databricks-lakeflow
Real-World data blog: https://www.databricks.com/blog/introducing-aibi-intelligent-analytics-real-world-data
Official Databricks documentation
Lakeflow blog: https://www.databricks.com/blog/introducing-databricks-lakeflow
Real-World data blog: https://www.databricks.com/blog/introducing-aibi-intelligent-analytics-real-world-data
Youtube recording first day AI summit: https://www.youtube.com/watch?v=SAsoWmMhX3Q
Official Databricks documentation

Jarne Demunter

Consultant @Aivix

Joachim Depovere

Consultant @ Aivix