Working with Hive Tables in Zeppelin Notebook and HDInsight Spark Cluster

Zeppelin notebooks are a web based editor for data developers, analysts and scientists to develop their code (scala, python, sql, ..) in an interactive fashion and also visualize the data.

I will demonstrate simply notebook functionality, query data in hive tables, aggregate the data and save to a new hive table. For more details, read

My HDInsight Spark 2.0 cluster is configured for Azure Data Lake Store as the primary data storage. The Zeppelin notebook has to be installed by going to the Add Service Wizard.

To start off, create a new note.

One empty paragraph where you can write a code and has its results section to display error message and outputs. A note can have many paragraphs and share the same context. For more details read
Working with Hive Tables in Zeppelin Notebook and HDInsight Spark Cluster-1

Show tables in the usdata database. Hit shift+enter to run the paragraph or hit the play button on the top right.
Working with Hive Tables in Zeppelin Notebook and HDInsight Spark Cluster-2

I create another paragraph with code to query a hive table, display the first 2 rows and the number of records.
Working with Hive Tables in Zeppelin Notebook and HDInsight Spark Cluster-3

%livy.spark declares the interpreter. The programming language is Scala.

crimesDF is a data frame object which contains the data in named columns according to the schema defined in the hive table it is being queried from. For details read

The number of records counted is 6,312,976. The execution took about 21 seconds of which I consider quite fast.

In this paragraph,

  • a query upon the crimesDF data frame created in the previous paragraph. crimebytypeDF will contain an aggregation of the number of crimes by its primary type and year.
  • This result will be saved into a new hive table using saveAsTable function.
  • Show the first two rows for testing
  • Convert data frame to RDD for future scenarios to apply sophisticated transformations. RDD loses the named column support in data frames.
  • Print the first 5 rows in the RDD for testing.

Working with Hive Tables in Zeppelin Notebook and HDInsight Spark Cluster-4

This execution took 1 minute and 52 seconds which is again very fast.
The full code:

spark.sql("Use usdata")
spark.sql("Show tables").show()	

val crimesDF = spark.sql("SELECT * FROM crimes"), false)

val crimebytypeDF =$"id", $"year", $"primarytype").where($"year" < 2017).groupBy($"primarytype", $"year").
    agg(countDistinct($"id") as "Count")

// save data frame of results into an existing or non-existing hive table.

// display first 2 rows, false) 

// convert data frame to RDD and print 5 first rows. RDD allows for more control on the data for transformations.
val crimebytypeRDD = crimebytypeDF.rdd

I have shown a brief and basic example of using the Zeppelin notebook and to manage data in hive tables for ad hoc data analysis scenarios. I like the fact that queries and operations are much faster than in Hive View in Ambari and in Visual Studio with Azure Data Lake Tools plugin. I like the notebook ‘paragraph’ concept as I can run small chunks of code rather than through separate files or by commenting and uncommenting lines of code. I would start off writing the core logic of my big data applications in notebooks and bring them into IntelliJ or visual studio where it would be more suitable for larger scale application development.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s