This article is for beginners looking to understand the developer experience in Visual Studio and working with hive tables in HDInsight.
I developed the following HiveQL statements. My cluster is HDInsight Spark 2.0 cluster.
USE USData; CREATE TABLE IF NOT EXISTS CrimesGroupByType ROW FORMAT DELIMITED FIELDS TERMINATED by ',' stored as textfile AS SELECT count(id) as CrimeCount, PrimaryType AS PrimaryCrime, year YEAR FROM crimes WHERE year <= 2017 GROUP BY PrimaryTYpe, Year; CREATE VIEW IF NOT EXISTS CrimesGroupByType_View AS SELECT count(id) as CrimeCount, PrimaryType AS PrimaryCrime, year YEAR FROM crimes WHERE year <= 2017 GROUP BY PrimaryTYpe, Year;
Before executing these statements, I have the database and tables:
The crimes table data looks like:
Let’s query the table with a count aggregation with Group By clause and save the data into a new hive table.
CREATE TABLE IF NOT EXISTS CrimesGroupByType ROW FORMAT DELIMITED FIELDS TERMINATED by ',' stored as textfile AS SELECT count(id) as CrimeCount, PrimaryType AS PrimaryCrime, year YEAR FROM crimes WHERE year <= 2017 GROUP BY PrimaryTYpe, Year;
As a result a new hive table is created with the resulting data
For scenarios where a view to encapsulate the same query is more appropriate:
CREATE VIEW IF NOT EXISTS CrimesGroupByType_View AS SELECT count(id) as CrimeCount, PrimaryType AS PrimaryCrime, year YEAR FROM crimes WHERE year <= 2017 GROUP BY PrimaryTYpe, Year;
As a result:
If we look into the file system in Azure Data Lake Store under \hive\warehouse:
We see crimesgroupbytype folder for its hive table. And there is no folder for the hive view.
Clicking into the folder we see the tables own data files
Clicking into the first file to see its data.
Creating a view is useful where you want to query for an aggregation where its underlying hive table is changing and dynamic. Compared to creating a new hive table populated with the same query, it will only get the resulting data for that point in time. Both will serve different purposes. Also, note that Power BI will have Hive views selectable.
These are just simple examples to build a data warehouse in Hive with HDInsight. As such, data analysts, data scientists and report builders can work off for conducting the analysis and building solutions.