An overview of Azure Databricks Spark Cluster creation

In this article, we will cover the steps for creating Azure Databricks workspace and configuring a Spark cluster. Later, we will create a simple data table from an existing CSV file and query the data using SQL notebook.

Azure Databricks is the most advanced Apache Spark platform. With a high-performance processing engine that’s optimized for Azure, you’re able to improve and scale your analytics on a global scale—saving valuable time and money, while driving new insights and innovation for your organization. Microsoft Azure Databricks offers an intelligent, end-to-end solution for all your data and analytics challenges.

For those who are new to Apache Spark, it is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala and consists of Sparkcore and several related projects: SparkSQL – Module for working with structured data.
Other than Databricks, Hortonworks is also providing the Spark platform in Azure, that is included in HDInsight. Cloudera is another vendor who provides the Spark platform independently.

Step 1 – Create the Azure data bricks workspace.

Login to the Azure portal.

Choose to Create a resource and select for Azure Databricks in the filter box. Please choose the workspace name, resource group, and location. If you don’t have a resource group, you can create a new one. Please note that currently, Azure Databricks provides  a 14 day trial premium subscription.

It will take some time to create the workspace and after creation, you can go to the resource and can launch the workspace. It will log in using Single Sign-on.

Now, we are going to create a new Spark Cluster. You can choose the number of workernodes, driver type, worker type, and a number of workers. Please note that every Spark Cluster must have a driver node and we can choose as many worker nodes as per your subscription and free available cores. Here, I chose Standard_DS3_v2 worker type and the same driver type. One node has 4 cores and I chose only 1 worker node for testing purposes.

It will take some time to set up the cluster. If you check the status, you can see the cluster creation status is Pending.

After some time, our cluster will be ready, and status will show as Running.

If you check the resource list, you can see there are 17 new resources created for this cluster.

There is 1 virtual network, 2 virtual machines (1 each for worker and driver), 1 storage account (for meta information, this is a locked resource), 2 public IPs, 1 network security group, 4 network interfaces and 6 disks (3 each for worker and driver).

2 – Create a table using existing CSV file

We can go to the Data tab and click add data button.

It will open a screen and we can browse our existing CSV from the local system.  This is a very simple CSV file containing 3 columns (id, name and age). I gave 5 rows inside this file.

The uploaded file will be stored to FileStore/tables folder. Databricks is following a custom DBFS (Data Bricks File System) file system developed by themselves.

Our csv will be uploaded, and we can choose the create table with UI button. You can choose Create table in Notebook option also. Notebook is a kind of editor where you can run your SQL, Python, R and Scala codes.

After clicking the Create Table with UI button, a table schema editor screen will appear.

Here, you can change the table name, column names and even column schemas also. Here, I chose the first row as header and infer schema options. Since we are creating this table from our existing CSV file I prefer these options. Please note the UI will show some sample data also.

You can click the Create Table button.

Our table is created successfully, and it shows the sample data too.


Step 3 – Querying the data using SQL notebook.

We can query our data using SQL notebook. Currently, there are 4 types of notebooks available in Azure Databricks. Python, Scala, R and SQL. In this article, we use SQL notebook.

Please select Workspace -> Create
-> Notebook -> Choose type as SQL
You can give a valid name to your notebook and click Create button.
We can execute all valid SQL statement in this notebook to query the data.
By default, data is shown as raw table format and if needed, you can view the data visually too. There are various plot options available in the notebook. The below report is shown in Bar chart format for our same data.
Other plot options are shown below.
Hope you got a basic overview on Azure Databricks workspace creation, cluster configuration, table creation and querying the data using SQL notebook. Please note that spark is not used for simple queries. The aim of multiple clusters is to process heavy data with high performance. We can cover more features of Azure Databricks on coming tutorials.

Please feel free to give your valuable
 feedback and queries in the comments.
Please follow and like me:

1 thought on “An overview of Azure Databricks Spark Cluster creation

Leave a Reply

Your email address will not be published. Required fields are marked *