HiveQL Group By and Views with Visual Studio and HDInsight

This article is for beginners looking to understand the developer experience in Visual Studio and working with hive tables in HDInsight.

I developed the following HiveQL statements. My cluster is HDInsight Spark 2.0 cluster.

USE USData;

CREATE TABLE IF NOT EXISTS CrimesGroupByType
ROW FORMAT DELIMITED
FIELDS TERMINATED by ',' stored as textfile
AS
SELECT count(id) as CrimeCount, PrimaryType AS PrimaryCrime, year YEAR FROM crimes
WHERE year <= 2017
GROUP BY PrimaryTYpe, Year;

CREATE VIEW IF NOT EXISTS CrimesGroupByType_View
AS
SELECT count(id) as CrimeCount, PrimaryType AS PrimaryCrime, year YEAR FROM crimes
WHERE year <= 2017
GROUP BY PrimaryTYpe, Year;

Before executing these statements, I have the database and tables:
Hive Group By and Views with Visual Studio and HDInsight-1

The crimes table data looks like:
Hive Group By and Views with Visual Studio and HDInsight-2

Let’s query the table with a count aggregation with Group By clause and save the data into a new hive table.

CREATE TABLE IF NOT EXISTS CrimesGroupByType
ROW FORMAT DELIMITED
FIELDS TERMINATED by ',' stored as textfile
AS
SELECT count(id) as CrimeCount, PrimaryType AS PrimaryCrime, year YEAR FROM crimes
WHERE year <= 2017
GROUP BY PrimaryTYpe, Year;

As a result a new hive table is created with the resulting data
Hive Group By and Views with Visual Studio and HDInsight-3Hive Group By and Views with Visual Studio and HDInsight-4

For scenarios where a view to encapsulate the same query is more appropriate:

CREATE VIEW IF NOT EXISTS CrimesGroupByType_View
AS
SELECT count(id) as CrimeCount, PrimaryType AS PrimaryCrime, year YEAR FROM crimes
WHERE year <= 2017
GROUP BY PrimaryTYpe, Year;

As a result:
Hive Group By and Views with Visual Studio and HDInsight-5

If we look into the file system in Azure Data Lake Store under \hive\warehouse:
Hive Group By and Views with Visual Studio and HDInsight-6

We see crimesgroupbytype folder for its hive table. And there is no folder for the hive view.
Clicking into the folder we see the tables own data files
Hive Group By and Views with Visual Studio and HDInsight-7

Clicking into the first file to see its data.
Hive Group By and Views with Visual Studio and HDInsight-8

Creating a view is useful where you want to query for an aggregation where its underlying hive table is changing and dynamic. Compared to creating a new hive table populated with the same query, it will only get the resulting data for that point in time. Both will serve different purposes. Also, note that Power BI will have Hive views selectable.

These are just simple examples to build a data warehouse in Hive with HDInsight. As such, data analysts, data scientists and report builders can work off for conducting the analysis and building solutions.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s