Work with Spark and Scala in IntelliJ Idea IDE – Part 2

In this article, we will see how to create an Azure HDInsight Spark cluster on the Azure portal and we will create one simple postal code application in IntelliJ IDEA with Scala. Then, we will execute it in Spark Cluster.

In my previous article, we already saw how to set up Scala in IntelliJ and get data from a local CSV file. We created a Spark Scala project and learned the steps for executing it in the local environment. If you have any doubts, please refer to that article.

Step 1 – Create an HDInsight Spark cluster on the Azure portal

Log in to Azure portal, choose “Create a resource”, and select “Analytics” >> “HDInsight”.

Give a name to your Spark cluster and then configure the Cluster type. Please select “Spark” as a cluster type and choose the default version. Here, the version is Spark 2.2.0.

Now, we can choose the default storage account. Every HDInsight cluster requires a storage account. If you do not have any storage account, please choose “Create New” option. It will automatically create a storage account along with cluster creation. I already have one storage account so, I selected it. You can add multiple storage accounts to this cluster. Spark can process data from associated storage accounts only. Please note that Spark will automatically create a default blob container in the default storage account.

We can choose the Cluster size. By default, it is 4 worker nodes. Spark cluster must have one driver node and can have multiple worker nodes. In our case, we are using this cluster only for testing purposes. I chose only 1 worker node. Our cost will vary depending on the number of worker nodes we use.

We can choose the node size now. I am opting for D12 V2 node. It comprises 4 cores per node and 28 GM RAM per node. 200 GB is the local SSD size. This is enough for our testing purposes.

We got the cluster summary now. We have a 2 node driver and a 1 node worker available.

Please click the “Create” button and it will take a minimum of 15 to 20 minutes to set up our cluster depending on the network traffic. After some time, our cluster will be created successfully. Please go to the cluster dashboard and see all the details about our cluster.

Step 2 – Upload CSV file to default container in default storage account

In this article, we will process the data from a Postal code CSV file. We have already downloaded this CSV file. We can upload this CSV file to our storage account which is already associated with Spark cluster. Please use Storage Explorer (Now, it is in Preview mode) feature to upload the CSV file.

Please choose the default container associated with our Spark cluster and create a new virtual directory.

We will upload the CSV file to this directory. Please use SAS (Shared Access Signature) authentication type to upload the file.

Step 3 – Run Spark application in HDInsight Spark cluster using IntelliJ IDEA

We can open the IntelliJ project which we created already. Please refer to my previous article for creating IntelliJ project with Spark and Scala. We already saw how to get data from a CSV file in a local environment. Please click the “Tools” menu and choose the “Azure” to Sign in. This will open a popup screen and with your Azure credentials please sign in.

We can make a small change to our Scala object file “indianpincode.scala”. Please replace the code with the below code.

package sample  
import org.apache.spark.sql.SparkSession  
object indianpincode {  
    def main(arg: Array[String]): Unit = {  
        val sparkSession: SparkSession = SparkSession.builder.appName("Scala Spark Example").getOrCreate()  
        val csvPO ="inferSchema", true).option("header", true).  
        //val count = sparkSession.sql("select * from tabPO").count()  
        sparkSession.sql("select statename as StateName,count(*) as TotalPOs from tabPO group by statename order by count(*) desc").show(50)  

We just changed the file location to our Blob location. “wasb:///sparksample/all_india_PO.csv”

sparksample is our virtual directory we already created in our default storage account.

We can access the CSV file from this directory using the Spark cluster.

Please submit the application to Spark cluster. Right-click the Scala object (indianpincode.scala) and click “Submit Spark Application to HDInsight” option.

It will open a new window and show our Spark cluster name. We can choose the Scala object name in “Main class name” and leave the Artifact name as it is. We can supply the number of executors, driver memory, executor memory, driver cores and executor cores for each Spark job submission if needed. Here we do not supply any parameters.
After some time, we will get the execution result. If the job is successfully submitted, we can see the result link. If there are any execution errors, we can also see the error logs by clicking the link. In our case, the Spark job is successfully completed. We can click the link and see the result.
If you click the link, it will open Ambari window (Ambari is a console provided by HDInsight) and we can again open the logs link.
Please click the “stdout” link and it will show our query result. We have got all the state wise postal office information.
If you open Spark cluster dashboard on the Azure portal, you can see an Ambari Home link. In Ambari Home, you can click “YARN” menu and click “Quick Links”. It will open another link and click “ResourceManager UI” link.
This Resource Manager will show all the submitted Spark Job statuses. You can double-click the link provided with each job and see more details.

Step 4 – Delete Spark Cluster after usage 

We have successfully executed our Scala code in Spark cluster and processed data from CSV file located in Blob storage. Now we can delete the cluster. It is very important to delete the cluster after usage. Otherwise, we must pay. Please click the “Delete” button.

You must confirm the delete by giving Spark cluster name in the given box.
In this article, we saw how to create an HDInsight Spark cluster in Azure portal and we uploaded one postal code data CSV file to our blob storage account which was already associated with Spark cluster. Later we connected Azure with IntelliJ IDEA and executed Spark Job in IntelliJ. We also saw the job result on the Ambari portal.

We will discuss more details of HDInsight in upcoming articles.

Please follow and like me:

Leave a Reply

Your email address will not be published. Required fields are marked *