Unlocking the power of AutoML in Databricks: simplifying Mmachine learning workflows

In today’s data-driven world, organizations are constantly seeking ways to extract insights and value from their data. Machine learning (ML) has emerged as a powerful tool to unlock these insights and drive business decisions. However, building and deploying ML models can be a complex and time-consuming process, often requiring expertise in data science and programming. This is where Automated Machine Learning (AutoML) comes into play, offering a streamlined approach to model development and deployment.

What is AutoML?

AutoML refers to the process of automating the end-to-end process of applying machine learning to real-world problems, including data pre-processing, feature engineering, model selection, hyperparameter tuning, and model evaluation. By automating these tasks, AutoML platforms aim to democratize machine learning, enabling users with varying levels of expertise to build high-quality models quickly and efficiently.

Benefits of AutoML in Databricks

  • Increased Efficiency: AutoML tools in Databricks automate repetitive tasks, allowing data scientists to focus on higher-level aspects of model development.
  • Reduced Time to Deployment: With AutoML, models can be developed in a fraction of the time compared to traditional methods, accelerating time-to-insight and time-to-market.
  • Improved Model Performance: AutoML platforms leverage advanced algorithms and techniques to explore a wide range of models and hyperparameters, leading to optimized model performance.
  • Democratization of ML: AutoML empowers users across the organization, regardless of their technical background, to leverage the power of machine learning for decision-making.

Getting Started with AutoML in Databricks

Setting Up Databricks Environment:

If you haven’t already, set up a Databricks environment with access to AutoML capabilities. Therefore, the Databricks Runtime for Machine Learning (Databricks Runtime ML) is required.

The Databricks Runtime for Machine Learning is a pre-configured environment designed for seamless machine learning and data science tasks. It comes equipped with a range of external libraries such as TensorFlow, PyTorch, Horovod, scikit-learn, and XGBoost. Additionally, it offers specialized enhancements aimed at optimizing performance. These enhancements include features like GPU acceleration for XGBoost, distributed deep learning via HorovodRunner, and the capability for model checkpointing using a Databricks File System (DBFS) FUSE mount.

Databricks Runtime ML can be chosen while setting up the cluster. A guide how to set up the cluster and the run time can be found in the documentation of Databricks.

Data used

In this blogpost, we use data that we extracted from the Elia api: https://opendata.elia.be/pages/home/.

Data Preparation and feature store

The first step when we think of Machine Learning is to ensure that your data is clean, properly formatted, and accessible within Databricks. This step is crucial for the success of any machine learning project. Once the data is clean and formatted, save the dataset in a table in the Databricks catalog. In this blog we enable the Feature Store in which our clean data can be saved. Databricks Feature Store seamlessly integrates with AutoML, providing a centralized repository for managing and serving features for ML models.

The Feature Store ensures consistency and reproducibility in feature engineering, enabling users to reuse and share feature definitions across projects and teams.

By incorporating the Feature Store into the data preparation pipeline, users can accelerate model development and deployment while maintaining data integrity and governance.

To create a table in the feature store, we import first the Feature Store Client (FSC), then we create an instance of FSC. The create table function has 4 main arguments: name of the table, the primary key, schema and description.

Figure 1: Creation of feature store table

After creating the table, we write the data to the table:

Figure 2: Write the data to the table in the feature store

Now that the data is saved in a feature store table, we can navigate to the table  via UI by clicking on the features tab and selecting the corresponding table:

Figure 3: Finding features tab via UI

By clicking on the corresponding table, we can find which features exist in the table, data type of each feature, by which models are these features used and the notebooks consuming these features.

Now, we retrieve the table from the feature store. This procedure is simple and can be done by one command. Here is an example of retrieving the data using python command:

Figure 4: Read the table from the feature store

In order to utilize Databricks AutoML, begin by using the read_table function to fetch the data, which requires the name of the table as an argument. This name is provided as a string in the format: ‘namespace.database_name.table_name’. Once the data is read into the notebook using this function, it can then be seamlessly passed to AutoML. Databricks AutoML simplifies the machine learning process for your dataset. By supplying the dataset and specifying the prediction target, AutoML takes over the subsequent steps. It manages data preparation, conducts trials to create and fine-tune multiple models, and records the results for further analysis.

Model Training

Now we come to the stage of utilising Databricks’ AutoML.
Databricks’ AutoML contains a variety of machine learning algorithms and architectures. An overview of these models can be found in the official website of Databricks via this link. You can experiment with different models to find the best fit for your data and problem domain. In this blog we try to forecast values. A well known model is Prophet which is available in Databricks AutoML library. To run an experiment and train prophet model, one line of code is needed:

Figure 5: Train a forecasting model using AutoML

The forecast function take multiple arguments:

  1. transformed_df: DataFrame containing transformed data for forecasting.
  2. target_col: Name of the target column to be forecasted.
  3. time_col: Name of the column representing timestamps.
  4. horizon: Number of future time steps to forecast.
  5. frequency: Frequency of time series data (e.g., “min” for minute-level data).
  6. primary_metric: Evaluation metric for model selection and evaluation.
  7. output_database: Name of the database to store forecast results.

Databricks made it also possible to run an experiment using the Experiments tab in UI. This allows people with no coding experience to train models in Databricks.

Go to the Experiments page, on the top right click “create AutoML experiment”:

Figure 6: Create AutoML experiment button

Clicking on the “create AutoML experiment” takes us to a page where we can fill different fields needed to train the model such as the cluster on which we run the model, ML problem type, source dataset, the target column that we are trying to predict, name of the experiment and other advanced options such as evaluation metric, stopping experiment condition and intermediate storage location. Once these fields are filled, we can start the experiment.

There exists also an optional button to join feature from another table to the features used to train the model. This could be useful when one wants to train a model with features from different tables.

Now we click on the “start AutoML” to and let AutoML experiment with different models and tune the hyperparameters. You will be redirected to the experiment overview page automatically.

Figure 7: Set up an AutoML experiment via UI

Once AutoML is done you will get an overview of the models trained during the experiment and their metrics.

Model Evaluation

In Databricks, evaluating models is an immersive experience. Robust tools visualize metrics, guiding us through cross-validation or holdout datasets. With AutoML, Databricks orchestrates experiments, creating data exploration notebooks and identifying the best models. The “Experiments” tab lists all experiments together with the best models and the metrics, empowering us to transform data into actionable decisions.

When you run the AutoMl experiment using the UI, you will see the different models and their metrics listed in the experiment page.

If your are running AutoMl experiment via code in the notebooks, then it is possible to surf to the experiment that we ran by clicking on experiment in the output of the cell.

Figure 8: Go to the experiment details

As mentioned before, once you are in the experiment page, you can see the different models trained during the experiment next to their metrics, notebooks and duration. The models are sorted by the value evaluation metric. The best model is listed on top. One of the powerful features of Databricks AutoML is the existence of two buttons: view notebook for best model and view data exploration notebook. In summary, the “View Notebook for Best Model” contains the code and steps used to train the top-performing model, while the “View Data Exploration Notebook” focuses on the exploratory analysis performed on the dataset before model training. Both notebooks play crucial roles in the machine learning pipeline, providing insights into the data and facilitating the development of effective models.
These two notebooks serve as foundational resources for advancing the model and enable data scientists to focus on the broader aspects of the domain, freeing them from delving into coding details.

Figure 9: Experiment results

In Databricks, model evaluation is more than numbers – it’s a journey of discovery and innovation.

Deployment

You can register and deploy your model with the AutoML UI:

  1. Select the link in the Models column for the model to register. When a run completes, the best model (based on the primary metric) is the top row.
  2. Select register model button to register the model in Model Registry.
  3. Select Models  in the sidebar to navigate to the Model Registry.
  4. Select the name of your model in the model table.
  5. From the registered model page, you can serve the model with Model Serving.

Best Practices for AutoML in Databricks

  • Data Quality: Invest time in data quality and preprocessing to ensure the accuracy and reliability of your models.
  • Iterative Development: Treat model development as an iterative process, continuously refining and improving your models based on feedback and new data.
  • Collaboration: Foster collaboration between data scientists, data engineers, and domain experts to leverage diverse perspectives and expertise.
  • Monitoring and Maintenance: Establish processes for monitoring model performance in production and for retraining models as needed to adapt to changing data distributions and business requirements.

Conclusion

By leveraging AutoML capabilities in Databricks, data teams can accelerate their ML initiatives. Whether you’re a seasoned data scientist or a business analyst, AutoML in Databricks can quickly test several data science models to have a first insight on the predictability of the features on your target.

Tom Thevelein

Technical Lead @ Aivix