In this article, we will cover the steps for creating Azure Databricks workspace and configuring a Spark cluster. Later, we will create a simple data table from an existing CSV file and query the data using SQL notebook.
Azure Databricks is the most advanced Apache Spark platform. With a high-performance processing engine that’s optimized for Azure, you’re able to improve and scale your analytics on a global scale—saving valuable time and money, while driving new insights and innovation for your organization. Microsoft Azure Databricks offers an intelligent, end-to-end solution for all your data and analytics challenges.
Step 1 – Create the Azure data bricks workspace.
Choose to Create a resource and select for Azure Databricks in the filter box. Please choose the workspace name, resource group, and location. If you don’t have a resource group, you can create a new one. Please note that currently, Azure Databricks provides a 14 day trial premium subscription.
It will take some time to create the workspace and after creation, you can go to the resource and can launch the workspace. It will log in using Single Sign-on.
Now, we are going to create a new Spark Cluster. You can choose the number of workernodes, driver type, worker type, and a number of workers. Please note that every Spark Cluster must have a driver node and we can choose as many worker nodes as per your subscription and free available cores. Here, I chose Standard_DS3_v2 worker type and the same driver type. One node has 4 cores and I chose only 1 worker node for testing purposes.
It will take some time to set up the cluster. If you check the status, you can see the cluster creation status is Pending.
After some time, our cluster will be ready, and status will show as Running.
If you check the resource list, you can see there are 17 new resources created for this cluster.
There is 1 virtual network, 2 virtual machines (1 each for worker and driver), 1 storage account (for meta information, this is a locked resource), 2 public IPs, 1 network security group, 4 network interfaces and 6 disks (3 each for worker and driver).
Step 2 – Create a table using existing CSV file
We can go to the Data tab and click add data button.
It will open a screen and we can browse our existing CSV from the local system. This is a very simple CSV file containing 3 columns (id, name and age). I gave 5 rows inside this file.
Our csv will be uploaded, and we can choose the create table with UI button. You can choose Create table in Notebook option also. Notebook is a kind of editor where you can run your SQL, Python, R and Scala codes.
Here, you can change the table name, column names and even column schemas also. Here, I chose the first row as header and infer schema options. Since we are creating this table from our existing CSV file I prefer these options. Please note the UI will show some sample data also.
You can click the Create Table button.
Our table is created successfully, and it shows the sample data too.
We can query our data using SQL notebook. Currently, there are 4 types of notebooks available in Azure Databricks. Python, Scala, R and SQL. In this article, we use SQL notebook.
Please select Workspace -> Create-> Notebook -> Choose type as SQL
Please feel free to give your valuable feedback and queries in the comments.