Exploring the potential of Databricks Assistant: the future of Data Lakehouse interaction?
On the past Databricks Data & AI Summit (2023), Databricks announced LakehouseIQ, an AI-powered engine that helps you get more insights in your data lakehouse, and facilitate interaction with it. While there is no official release date published for LakehouseIQ, Databricks just released the Databricks Assistant, an AI-based companion that helps you coding and interacting with your lakehouse’s data. In this blog we’ll cover how you can enable it, what it exactly does, how it differs from other tools like ChatGPT and Copilot, and some key things you should know about it.
How can I use the Databricks Assistant?
The Assistant is nicely integrated in your Databricks workspace. While it is still in Public Preview, it is possible that an account admin needs to enable it before you can use it. This can be done by going to the Account Portal (Manage Account at the top right corner of your workspace), then to settings, followed by enabling “Enable third party services for AI assistive features”.
Once this is done, you can find the Assistant in the left ribbon when you open a notebook or .py file.
What can I do with the Databricks Assistant?
Autocomplete your code
A first thing the Assistant can do is suggest some code. You can add comments where you describe something in a cell, and the Assistant can autocomplete this. This can be triggered by pressing Ctrl+Shift+Space. Note that it sometimes takes a few seconds before you get the suggestion.
Explain and transform code
Assume your go-to programming language is Python, and you wrote a specific query with crucial business insights. If you want to share this code with a business analyst, who only works with SQL, you can easily convert the Pyspark code to SQL.
Now, let’s think about a scenario where someone is on holiday, but a crucial pipeline he or she created is failing. You need to fix the code, but don’t understand what it is doing, or what it is trying to achieve. This can also be solved by the Assistant, as you can ask it to explain some code.
Ask Databricks-specific questions
The Assistant can also help you with Databricks-specific questions. Every LLM typically contains only information until a certain period (e.g. ChatGPT 4 knows information until September 2021). Something similar is valid for the Assistant. However, this one is fine-tuned frequently with Databricks-specific documentation. For example, if we ask “What are Databricks Asset Bundles?”, a new tool that was just announced in the most recent Data & AI Summit, we get a good, up to date answer. However, if we ask about ‘Liquid clustering’, a Delta Lake feature that was announced the same period, the Assistant can’t provide an answer.
How does Databricks Assistant differ from other tools like ChatGPT?
While the above is pretty cool, these are all things we have seen a lot in the past half year with other tools like ChatGPT, GitHub Copilot, Google Bard, … So, what’s the added value of the Assistant then exactly?
A very powerful feature of the Databricks Assistant is that it automatically integrates additional context from your workspace. In order to provide more accurate responses, it considers code and comments from within your notebooks, but also Unity Catalog metadata like table schema, comments, tags, …
Tables that are used a lot, will also be more favored in the answers.
Let’s use the following case as an example. There is a basic table ‘sales’ that contains the sales of different stores. The table looks like below
If we ask the Databricks assistant “Give me a SQL query that returns the total sales per store”, we get the expected result.
Now assume that the company has some corporate jargon ‘SCBIM’ which stands for ‘Store Controlled By Independent Manager’. If we ask the sales per SCBIM to the Assistant, it doesn’t know what to return, because it doesn’t know the meaning of the abbreviation.
Within Unity Catalog, there is the option to add comments to tables. If we now add the comment “Sales per SCBIM” to the table, and ask the question above again, we see that we get a different answer.
The Assistant now returns the desired query.
While the case above is a relatively basic one, the Assistant can also provide help in more complex scenarios. The more the Assistant is used, the more it will remember typical query patterns, and understand the specific business logic.
Things to keep in mind
The first thing to know is that the Assistant uses table and column metadata, but it doesn’t know what’s actually stored inside the rows of the table. If there were columns containing specific corporate jargon, it would not be able to use this info at the moment. This is something we’ll probably see more once LakehouseIQ is entirely rolled out.
Secondly, the Assistant will never execute code on your behalf. It always requires the user to validate the code/query, add it to a cell and execute it. This is something important. While probably all of us used ChatGPT or other LLMs to get good results, we have all faced cases where the model returned answers that didn’t make sense (called hallucination). This is also the case for the Databricks Assistant, as it can return answers or questions that are not what you are looking for.
In the example below, the Assistant returns a query with non-existent tables. At the end, it mentions that the tables in the query should be replaced by the ones being used in the notebook, but at that time there were no similar tables used in the notebook.
Finally, at the time of writing it is not entirely clear which LLM is used by the Assistant. In the documentation it is mentioned that the model uses Azure OpenAI Services. By asking it some specific questions, it appears that the cutoff date is somewhere in the second half of 2021 (It thought the most recent Data & AI Summit was in May 2021). This makes us assume it will be using GPT3.5. This is something that should be kept in mind when asking questions to the Assistant. As mentioned above, it is fine-tuned on more recent Databricks documentation, but it is not clear how far this goes. For example, will it also be aware of the most recent Spark, Delta Lake, … releases? Like always, it remains important to be careful with the answers.
With the Databricks Assistant we have a new tool available that will help users in accessing their data assets. As the Assistant seamlessly integrates into the Databricks workspace, it can be used by data engineers, analysts and scientists to speed up their day-to-day tasks. While there are currently many tools available for programming, this one stands out because it automatically includes almost all your business knowledge that is kept inside your lakehouse.
data & analytics consultant @ Aivix