Text Analytics of Movie Reviews using Azure Data Lake, Cognitive Services and Power BI (part 1 of 2)

On March 1, 2017February 1, 2018 By Roy Kim (MVP)In Azure Data Platform, Azure PaaS1 Comment

Part 1 of 2: Text Analytics of Movie Reviews using Azure Data Lake, Cognitive Services and Power BI (part 1 of 2) Take a csv file, analyze with an U-SQL script in Azure Data Lake Part 2 of 2: Text Analytics of Movie Reviews using Azure Data Lake, Cognitive Services and Power BI (part 2 of 2) …

Continue reading Text Analytics of Movie Reviews using Azure Data Lake, Cognitive Services and Power BI (part 1 of 2)

HiveQL Group By and Views with Visual Studio and HDInsight

On February 28, 2017April 30, 2017 By Roy Kim (MVP)In Azure Data PlatformLeave a comment

This article is for beginners looking to understand the developer experience in Visual Studio and working with hive tables in HDInsight. I developed the following HiveQL statements. My cluster is HDInsight Spark 2.0 cluster. Before executing these statements, I have the database and tables: The crimes table data looks like: Let’s query the table with …

Continue reading HiveQL Group By and Views with Visual Studio and HDInsight

Query Hive Tables with Ambari Hive Views in HDInsight

On February 9, 2017April 30, 2017 By Roy Kim (MVP)In Azure Data PlatformLeave a comment

This is an introductory walk through of querying hive tables and visualizing the data in the Ambari Hive View. This is another option to build and debug HiveQL other than in Visual Studio with the Azure Data Lake Tools plugin. In my blog article Populating data into hive tables, I demonstrated populating internal and external hive …

Continue reading Query Hive Tables with Ambari Hive Views in HDInsight

Azure Search: Pushing Content to an Index with the .NET SDK.

On February 4, 2017February 1, 2018 By Roy Kim (MVP)In Azure Data Platform, Azure IaaS, Azure PaaS2 Comments

Blog Series Azure Search Overview Pushing Content To An Index with the .NET SDK I hold the opinion that for a robust indexing strategy, you would likely end up writing a custom batch application between your desired data sources and your defined Azure Search index. The pull method currently only supports data sources that reside …

Continue reading Azure Search: Pushing Content to an Index with the .NET SDK.

Azure Search Overview

On February 2, 2017November 6, 2018 By Roy Kim (MVP)In Azure, Azure Data Platform, Azure PaaS2 Comments

Blog Series Azure Search Overview Pushing Content To An Index with the .NET SDK Azure Search is a platform-as-a-service offering. This requires code and configuration to set up and use. Applicable corporate scenarios Enterprise search on many repositories of data or files that are intended to be available for a wide audience. A lightweight one-stop …

Continue reading Azure Search Overview

Working with Hive Tables in Zeppelin Notebook and HDInsight Spark Cluster

On January 31, 2017April 30, 2017 By Roy Kim (MVP)In Azure Data PlatformLeave a comment

Zeppelin notebooks are a web based editor for data developers, analysts and scientists to develop their code (scala, python, sql, ..) in an interactive fashion and also visualize the data. I will demonstrate simply notebook functionality, query data in hive tables, aggregate the data and save to a new hive table. For more details, read …

Continue reading Working with Hive Tables in Zeppelin Notebook and HDInsight Spark Cluster

The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS

On January 7, 2017February 1, 2018 By Roy Kim (MVP)In Azure Data PlatformLeave a comment

In my blog post Populating Data into Hive Tables in HDInsight, I have demonstrated populating an internal and an external hive table in HDInsight. The primary storage is configured with Azure Data Lake Store. To see the differences, I will demonstrate dropping both types of tables and observe the effects. This for the beginner audience. To recap …

Continue reading The Effects of Dropping Internal and External Hive Tables in HDInsight and ADLS

Populating Data into Hive Tables in HDInsight

On December 15, 2016April 30, 2017 By Roy Kim (MVP)In Azure Data Platform2 Comments

Objective: Populate a csv file to an internal and external Hive table in HDInsight. See my blog post on create hive tables Creating Internal and External Hive Tables in HDInsight I have obtained a 1.4GB csv file on US city crimes data from https://catalog.data.gov/dataset/crimes-2001-to-present-398a4 My HDInsight cluster is configured to use Azure Data Lake store …

Continue reading Populating Data into Hive Tables in HDInsight

Creating Internal and External Hive Tables in HDInsight

On December 10, 2016April 30, 2017 By Roy Kim (MVP)In Azure Data Platform3 Comments

Objective: Create an internal and an external hive tables in HDInsight. Based on the schema of a CSV file on US city crime. https://catalog.data.gov/dataset/crimes-2001-to-present-398a4 Building Hive tables establishes a schema on the flat files that I have stored in Azure Data Lake Store. This will allow me to do SQL like queries with HiveQL on that …

Continue reading Creating Internal and External Hive Tables in HDInsight

Create HDInsight Spark Cluster with Azure Data Lake Store

On December 3, 2016April 24, 2017 By Roy Kim (MVP)In Azure, Azure Data Platform, Azure PaaSLeave a comment

The Spark cluster is one of the several cluster types that is offered through HDInsight platform-as-a-service. The unique capabilities of the Spark cluster are the in-memory processing that supports overall performance benefit over Hadoop cluster type. As a result, build big data analytics applications. For further overview read https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-overview I will walk through and comment …

Continue reading Create HDInsight Spark Cluster with Azure Data Lake Store

Roy Kim on Azure and AI

Category: Azure Data Platform