Uploading Json data to an on-premise REST API using Azure Data Factory and Azure Databricks

The goal of this exercise is to upload some data we read from csv files, structure them as Json and upload the data to a REST API. The catch is that we are uploading this data to an on-premise API from our azure cloud. The API requires authentication via an access token, and the API to get the access token from is not accessible from outside of the private network.

Architecture

Our data destination endpoint is available from the public web. It does however require an access token that is acquired from the authentication endpoint. This endpoint is only accessible through a self-hosted integration runtime (SHIR).

What is a self-hosted integration runtime?

As described by Microsoft: The integration runtime (IR) is the compute infrastructure that Azure Data Factory and Synapse pipelines use to provide data-integration capabilities across different network environments. A self-hosted integration runtime (SHIR) can run copy activities between a cloud data store and a data store in a private network. It also can dispatch transform activities against compute resources in an on-premises network or an Azure virtual network.

For a SHIR to work, you need an on-premise virtual machine. This machine acts as a gateway for an express route to your azure data factory and cloud storage.

1 Image by Microsoft

Step 1 Authentication and acquiring an access token

To access the authentication endpoint, we must build a web activity in our ADF pipeline. We can enter the url of the endpoint and the type of HTTP method we want to use (POST in this case).

We must also enter credentials. Note that it is not safe to store credentials in plain text. The best practice is to use an Azure Key Vault. We can call another web activity to retreive the secret and use it in our current web activity. This process is explained in the following link.

https://learn.microsoft.com/en-us/azure/data-factory/how-to-use-azure-key-vault-secrets-pipeline-activities

The following headers are also required by the Authentication endpoint. It lets the endpoint know in what format we’re sending the request and in what format we want the response.

In this web activity, under the advanced settings, we select our self-hosted integration runtime, so the authentication endpoint is resolvable, and our call isn’t blocked by the firewall.

If our call is successful, we’ll get a response that includes our access token.


Step 2 Creating PUT requests in python

Using Dynamic input, we can pass this token from the output of our web activity to the input of our notebook activity. The Notebook has to use widgets (dbutils.widgets) to accept this input.

After reading the csv data and structing it in the correct JSON format we are left with the following data frame:

Each row contains a stop area id which coincides with a distinct node on the data destination endpoint REST API and the JSON data we want to upload to the API. For example, we can create a PUT request to the following URL:

{Data destination endpoint url}/stoparea/{stopareaId}

The body of the PUT request is the JSON data in the second column.

We turn this data frame into an RDD so we can iterate over the rows and make a separate API request for each row.

Creating a header variable, we can use the access token in the authorization header. Then we loop over the rows using the stopareaId in the URL and the JSON data in the body. In the end, you can add a sleep function to make sure you don’t overload the API.

Bart Bryon

Data & analytics consultant