Re-Create HDInsight Cluster with Pre-Existing Azure Data Lake Store and Hive Metastore

Upon creating an HDInsight cluster for big data analytics development, the costs can be quite high at around $4 an hour. This adds up, especially when running idle most of the time. To save on the costs, I delete the cluster and re-create when needed.

For simple development purposes, my HDInsight depends on

  • Azure Data Lake Store as its primary data storage (i.e. HDFS compliant)
  • Hive Metastore Azure SQL DB, to preserve my hive tables

Re-creating manually in Azure Portal is time-consuming and tedious. Therefore, I script in Power Shell the creation of the HDInsight cluster and re-point to ADLS and Hive Metastore. The costs of these services are relatively very small.

I used the following online references, but they don’t exactly serve my configuration.

Create HDInsight clusters with Data Lake Store as default storage by using PowerShell

New- Azure Rm HD Insight Cluster

With a lot of trial and error, the following script is a working start.

The high-level logic is

  1. Login to Azure with an admin account and set the subscription
  2. Setup the ADLS service principal and certificate so that HDInsight cluster has access to ADLS
  3. Grant permissions to ADLS folders
  4. Setup config for Hive metastore DB and HDInsight cluster identity
  5. Create the HDInsight cluster

Power Shell Script

[code language=”powershell”]
# Existing ADLS
$dataLakeStoreName = "rkdatalake"
$subscriptionId = "<Sub Id>"
$resourceGroupName = "rkbigdata"
$myrootdir = "/"
$certificateFilePath = "C:\Users\Roy\Downloads\adls-cert-rkdatalake.pfx"
$certificatePassword = "Winterwinter09"
# Existing Hive MetaStore
$hivemetastoreSqlAzureServerName = ""
$hivemetastoreDBName = "HiveMetaStore"
$hivemetastoreDBUsername = "rkim";
$hivemetastoreDBPassword = ConvertTo-SecureString "<Password>" -AsPlainText -Force
$hivemetastoreCredentials = New-Object System.Management.Automation.PSCredential ($hivemetastoreDBUsername, $hivemetastoreDBPassword)
# HDInsight Cluster (To be created)
$clusterRootPath = $myrootdir+"clusters/rkhdinsight"
$clusterType = "Spark"
$clusterName = "rkhdinsight"
$clusterVersion = "3.5"
$location = "Central US" # Region
$storageRootPath = $clusterRootPath # E.g. /clusters/hdiadlcluster
$clusterNodes = 3 # The number of nodes in the HDInsight cluster
$adminName = "admin"
$adminPassword = ConvertTo-SecureString "<Password>" -AsPlainText -Force
$httpCredentials = New-Object System.Management.Automation.PSCredential ($adminName, $adminPassword)
$sshuserName = "sshuser"
$sshuserPassword = ConvertTo-SecureString "<Password>" -AsPlainText -Force
$sshuserCredentials = New-Object System.Management.Automation.PSCredential ($sshuserName, $sshuserPassword)
$hivemetastoreAdmin = "kimr"
$metastoreadminPassword = ConvertTo-SecureString "<Password>" -AsPlainText -Force
$metastoreadminCredentials = New-Object System.Management.Automation.PSCredential ($hivemetastoreAdmin, $metastoreadminPassword)


# Sign in to your Azure account
# List all the subscriptions associated to your account
# Select a subscription
Set-AzureRmContext -SubscriptionId $subscriptionId

$resourceGroup = Get-AzureRmResourceGroup -Name $resourceGroupName -Location "Central US"
Test-AzureRmDataLakeStoreAccount -Name $dataLakeStoreName
New-AzureRmDataLakeStoreItem -Folder -AccountName $dataLakeStoreName -Path $clusterRootPath -Confirm $true -Force

# ADLS Service Principal and Certificate
$certificatePFX = New-Object System.Security.Cryptography.X509Certificates.X509Certificate2($certificateFilePath, $certificatePassword)

# Service Principal which is set to have access to ADLS
$servicePrincipal = Get-AzureRmADServicePrincipal -SearchString rkdatalakestoretemp
$servicePrincipalobjectId = $servicePrincipal.Id
write-host "service principal object Id: " $servicePrincipalobjectId

# Grant the Service Prinicipal permissions to the following 3 folders in ADLS
Set-AzureRmDataLakeStoreItemAclEntry -AccountName $dataLakeStoreName -Path / -AceType User -Id $servicePrincipalobjectId -Permissions All
Set-AzureRmDataLakeStoreItemAclEntry -AccountName $dataLakeStoreName -Path /clusters -AceType User -Id $servicePrincipalobjectId -Permissions All
Set-AzureRmDataLakeStoreItemAclEntry -AccountName $dataLakeStoreName -Path $clusterRootPath -AceType User -Id $servicePrincipalobjectId -Permissions All

$tenantID = (Get-AzureRmContext).Tenant.TenantId

# Setup configuration to existing Hive Metastore Azure SQL DB and HDInsight cluster Identity
$azureHDInsightConfig = New-AzureRMHDInsightClusterConfig -debug -ErrorVariable $errorVar `
-ClusterType Spark `
| Add-AzureRMHDInsightMetastore `
-SqlAzureServerName $hivemetastoreSqlAzureServerName `
-DatabaseName $hivemetastoreDBName `
-Credential $hivemetastoreCredentials `
-MetastoreType HiveMetastore `
| Add-AzureRmHDInsightClusterIdentity `
-AadTenantId $tenantId `
-ObjectId $servicePrincipalobjectId `
-CertificateFilePath $certificateFilePath `
-CertificatePassword $certificatePassword `

# output any error
Write-Output $errorVar

# Create new HDInsight cluster with config object
New-AzureRmHDInsightCluster `
-ClusterType $clusterType `
-OSType Linux `
-ClusterSizeInNodes 2 `
-ResourceGroupName $resourceGroupName `
-ClusterName $clusterName `
-HttpCredential $httpCredentials `
-Location $location `
-DefaultStorageAccountType AzureDataLakeStore `
-DefaultStorageAccountName "$" `
-DefaultStorageRootPath $clusterRootPath `
-Version 3.5 `
-SshCredential $sshuserCredentials `
-AadTenantId $tenantId `
-ObjectId $servicePrincipalobjectId `
-Config $azureHDInsightConfig `
-debug -ErrorVariable $errorVar `

# output any error
Write-Output $errorVar
Upon executing the script, go to the Azure Portal and see if the HDInsight cluster is being created. If you see Applying changes then it is in progress. This takes about 15-20 minutes.
Re-Create HDInsight Cluster with Pre-Existing Azure Data Lake Store and Hive Metastore-1

Pain Points

  • General typos such as file paths with too many slashes and misspellings. Just be extra careful.
  • The following error was a false positive. The .pfx file indeed has a private key.

[code language=”powershell”]
– "errors": [
– {
– "code": "InvalidDocumentErrorCode",
– "message": "DeploymentDocument ‘AmbariConfiguration_1_7’ failed the validation. Error: ‘Error while getting access to the datalake storage account rkdata
– lake: The private key is not present in the X.509 certificate..’"
– }
To resolve, I had the certificate configured as

[code language=”powershell”]
Add-AzureRmHDInsightClusterIdentity `
-AadTenantId $tenantId `
-ObjectId $servicePrincipalobjectId `
-CertificateFilePath $certificateFilePath `
-CertificatePassword $certificatePassword

rather than -CertificateFIleContents

[code language=”powershell”]
New-AzureRmHDInsightCluster : Response status code indicates server error: 500 (InternalServerError).
At line:3 char:1
+ New-AzureRmHDInsightCluster `
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : CloseError: (:) [New-AzureRmHDInsightCluster], CloudException
+ FullyQualifiedErrorId : Microsoft.Azure.Commands.HDInsight.NewAzureHDInsightClusterCommand

Debugging Tips

-Debug switch

The –Debug switch allows seeing various execution diagnostics and any specific error message. For example and in some cases, able to see some details of the HTTP request and HTTP response along with JSON data.

Error Variable

Other Tips

Don’t run provisioning script right immediately after deleting cluster and confirmation. Wait a couple of minutes so that all the Azure resources behind the scenes are cleared up even if the Azure Portal UI confirms it has been deleted.


2 thoughts on “Re-Create HDInsight Cluster with Pre-Existing Azure Data Lake Store and Hive Metastore

Leave a Reply