Blog Series:
- Creating Azure Data Lake
- PowerShell and Options to upload data to Azure Data Lake Store
- Using Azure Data Lake Store .NET SDK to Upload Files
- Creating Azure Data Analytics
- Azure Data Lake Analytics: Database and Tables
- Azure Data Lake Analytics: Populating & Querying Tables
- Azure Data Lake Analytics: How To Extract JSON Files
- Azure Data Lake Analytics: U-SQL C# Programmability
- Azure Data Lake Analytics: Job Execution Time and Cost
The ability to read files from Azure Data Lake is dependent on U-SQL Built-in Extractors:
- Extractors.Text() : Provides extraction from delimited text files of different encodings.
- Extractors.Csv() : Provides extraction from comma-separated value (CSV) files of different encodings.
- Extractors.Tsv() : Provides extraction from tab-separated value (TSV) files of different encodings.
An example from my previous blog post using the CSV Extractor:
@csvData = EXTRACT policyID int, statecode string, county string, eq_site_limit float, hu_site_limit float, fl_site_limit float, fr_site_limit float, tiv_2011 float, tiv_2012 float, eq_site_deductible float, hu_site_deductible float, fl_site_deductible float, fr_site_deductible float, point_latitude float, point_longitude float, line string, construction string, point_granularity int? FROM "/MyData/FL_insurance_sample.csv" USING Extractors.Csv(skipFirstNRows:1);
I need to work with JSON files; however, there is no built-in support for JSON which I found surprising. Especially, when much of internet data is in the form of JSON from APIs these days. Also there should be XML extractor as well.
To solve this challenge, I found a solution in the following github repo https://github.com/Azure/usql. It has extractors for both JSON and XML https://github.com/Azure/usql/tree/master/Examples/DataFormats/Microsoft.Analytics.Samples.Formats
The readme has instructions how to register and sample scripts.
This is how it looks when I registered the 2 assemblies
The U-SQL script:
REFERENCE ASSEMBLY [Newtonsoft.Json]; REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats]; DECLARE @inputfile string="/jobpostings/postings-{*}.json"; @jobPostingsSchema = EXTRACT jobtitle string, company string, city string, state string, country string, date string, snippet string, url string, latitude float, longitude float FROM @inputfile USING new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor("Results[*]"); OUTPUT @jobPostingsSchema TO "/jobpostings/JSONoutput.csv" USING Outputters.Csv();
The REFERENCE ASSEMBLY expression makes use of the registered assemblies in the JobPostingsBigData. Ensure that the script is using this database.
The extractor is reading multiple files that match the expression /jobpostings/postings-{*}.json. I have file names that end with date.
The JSON file structure
{ "TotalResults": 41, "PageNumber": 0, "Results": [ { "jobtitle": "Junior Web Developer", "company": "Company X", "city": "Calgary", "state": "AB", "country": "CA" }, {……}] }
I am only interested in extracting the Results collection. Therefore, I have a parameter value of “Results[*]” new Microsoft.Analytics.Samples.Formats.Json.JsonExtractor(“Results[*]”);
This extracts each record in this collection.
I can then take the extraction in the @jobPostingsSchema variable write to a flat file or to a table. In this script, I chose to store in a CSV file.
Upon submitting the script, the following is the file preview of the output flat file.
The JSON extractor would be applicable when retrieving data from REST APIs as it is becoming the modern standard. It would be a good idea that this become built-in as part of increasing support for a rich set of capabilities.