Populating Data into Hive Tables in HDInsight


Objective:
Populate a csv file to an internal and external Hive table in HDInsight.

See my blog post on create hive tables Creating Internal and External Hive Tables in HDInsight

I have obtained a 1.4GB csv file on US city crimes data from https://catalog.data.gov/dataset/crimes-2001-to-present-398a4

My HDInsight cluster is configured to use Azure Data Lake store as its primary data storage.

  1. Go to Azure Data Lake Store > Data Explorer
  2. Upload csv file twice in 2 separate locations.
    Upload to adl://rkdatalake.azuredatalakestore.net/datasets for internal hive table and adl://rkdatalake.azuredatalakestore.net/datasets/crimes for external hive table. I will explain why further.
    Populating Data into Hive Tables in HDInsight-1
    Populating Data into Hive Tables in HDInsight-2
  3. In Visual Studio with Azure Data Lake Tools plugin, open an existing or new Hive project.
  4. Execute script:
    LOAD DATA INPATH 'adl://rkdatalake.azuredatalakestore.net/datasets/Crimes_-_2001_to_present.csv' OVERWRITE INTO TABLE crimes;
    
  5. Query crimes table to test data load
    Populating Data into Hive Tables in HDInsight-3
  6. Go back to adl://rkdatalake.azuredatalakestore.net/datasets and you will notice that the csv file is no longer present. It has been moved to the hive/warehouse.
    Populating Data into Hive Tables in HDInsight-4
  7. Execute Script to set the external data location.
    ALTER TABLE Crimes_ext SET LOCATION 'adl://rkdatalake.azuredatalakestore.net/datasets/Crimes/'
    

    I found that the location must be a folder containing the csv file(s) and not a specific csv file path.

  8. Execute Script to query external table:
    Populating Data into Hive Tables in HDInsight-5
  9. In contrast to the internal crimes table, the crimes_ext under hive\warehous does not contain any data files. This is to confirm this is by design of an external table
    Populating Data into Hive Tables in HDInsight-6

I have demonstrated the differences in populating data for an internal and an external hive table. For a Q&A discussion between the differences read http://stackoverflow.com/questions/17038414/difference-between-hive-internal-tables-and-external-tables.

In my following article The effects of Dropping Hive Tables, I will show the effects of dropping each type of hive table.


2 thoughts on “Populating Data into Hive Tables in HDInsight

  1. Pingback: Query Hive Tables with Ambari Hive Views in HDInsight – Roy Kim on SharePoint, Azure, BI, Office 365

  2. Pingback: The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS – Roy Kim on SharePoint, Azure, BI, Office 365

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s