Upon creating an HDInsight cluster for big data analytics development, the costs can be quite high at around $4 an hour. This adds up, especially when running idle most of the time. To save on the costs, I delete the cluster and re-create when needed.
For simple development purposes, my HDInsight depends on
- Azure Data Lake Store as its primary data storage (i.e. HDFS compliant)
- Hive Metastore Azure SQL DB, to preserve my hive tables
Re-creating manually in Azure Portal is time-consuming and tedious. Therefore, I script in Power Shell the creation of the HDInsight cluster and re-point to ADLS and Hive Metastore. The costs of these services are relatively very small.
I used the following online references, but they don’t exactly serve my configuration.
Create HDInsight clusters with Data Lake Store as default storage by using PowerShell
New- Azure Rm HD Insight Cluster
With a lot of trial and error, the following script is a working start.
The high-level logic is
- Login to Azure with an admin account and set the subscription
- Setup the ADLS service principal and certificate so that HDInsight cluster has access to ADLS
- Grant permissions to ADLS folders
- Setup config for Hive metastore DB and HDInsight cluster identity
- Create the HDInsight cluster
Power Shell Script
# Existing ADLS $dataLakeStoreName = "rkdatalake" $subscriptionId = "<Sub Id>" $resourceGroupName = "rkbigdata" $myrootdir = "/" $certificateFilePath = "C:\Users\Roy\Downloads\adls-cert-rkdatalake.pfx" $certificatePassword = "Winterwinter09" # Existing Hive MetaStore $hivemetastoreSqlAzureServerName = "rkbigdata.database.windows.net" $hivemetastoreDBName = "HiveMetaStore" $hivemetastoreDBUsername = "rkim"; $hivemetastoreDBPassword = ConvertTo-SecureString "<Password>" -AsPlainText -Force $hivemetastoreCredentials = New-Object System.Management.Automation.PSCredential ($hivemetastoreDBUsername, $hivemetastoreDBPassword) # HDInsight Cluster (To be created) $clusterRootPath = $myrootdir+"clusters/rkhdinsight" $clusterType = "Spark" $clusterName = "rkhdinsight" $clusterVersion = "3.5" $location = "Central US" # Region $storageRootPath = $clusterRootPath # E.g. /clusters/hdiadlcluster $clusterNodes = 3 # The number of nodes in the HDInsight cluster $adminName = "admin" $adminPassword = ConvertTo-SecureString "<Password>" -AsPlainText -Force $httpCredentials = New-Object System.Management.Automation.PSCredential ($adminName, $adminPassword) $sshuserName = "sshuser" $sshuserPassword = ConvertTo-SecureString "<Password>" -AsPlainText -Force $sshuserCredentials = New-Object System.Management.Automation.PSCredential ($sshuserName, $sshuserPassword) $hivemetastoreAdmin = "kimr" $metastoreadminPassword = ConvertTo-SecureString "<Password>" -AsPlainText -Force $metastoreadminCredentials = New-Object System.Management.Automation.PSCredential ($hivemetastoreAdmin, $metastoreadminPassword) $errorVar # Sign in to your Azure account Login-AzureRmAccount # List all the subscriptions associated to your account Get-AzureRmSubscription # Select a subscription Set-AzureRmContext -SubscriptionId $subscriptionId $resourceGroup = Get-AzureRmResourceGroup -Name $resourceGroupName -Location "Central US" Test-AzureRmDataLakeStoreAccount -Name $dataLakeStoreName New-AzureRmDataLakeStoreItem -Folder -AccountName $dataLakeStoreName -Path $clusterRootPath -Confirm $true -Force # ADLS Service Principal and Certificate $certificatePFX = New-Object System.Security.Cryptography.X509Certificates.X509Certificate2($certificateFilePath, $certificatePassword) # Service Principal which is set to have access to ADLS $servicePrincipal = Get-AzureRmADServicePrincipal -SearchString rkdatalakestoretemp $servicePrincipalobjectId = $servicePrincipal.Id write-host "service principal object Id: " $servicePrincipalobjectId # Grant the Service Prinicipal permissions to the following 3 folders in ADLS Set-AzureRmDataLakeStoreItemAclEntry -AccountName $dataLakeStoreName -Path / -AceType User -Id $servicePrincipalobjectId -Permissions All Set-AzureRmDataLakeStoreItemAclEntry -AccountName $dataLakeStoreName -Path /clusters -AceType User -Id $servicePrincipalobjectId -Permissions All Set-AzureRmDataLakeStoreItemAclEntry -AccountName $dataLakeStoreName -Path $clusterRootPath -AceType User -Id $servicePrincipalobjectId -Permissions All $tenantID = (Get-AzureRmContext).Tenant.TenantId # Setup configuration to existing Hive Metastore Azure SQL DB and HDInsight cluster Identity $azureHDInsightConfig = New-AzureRMHDInsightClusterConfig -debug -ErrorVariable $errorVar ` -ClusterType Spark ` | Add-AzureRMHDInsightMetastore ` -SqlAzureServerName $hivemetastoreSqlAzureServerName ` -DatabaseName $hivemetastoreDBName ` -Credential $hivemetastoreCredentials ` -MetastoreType HiveMetastore ` | Add-AzureRmHDInsightClusterIdentity ` -AadTenantId $tenantId ` -ObjectId $servicePrincipalobjectId ` -CertificateFilePath $certificateFilePath ` -CertificatePassword $certificatePassword ` # output any error Write-Output $errorVar # Create new HDInsight cluster with config object New-AzureRmHDInsightCluster ` -ClusterType $clusterType ` -OSType Linux ` -ClusterSizeInNodes 2 ` -ResourceGroupName $resourceGroupName ` -ClusterName $clusterName ` -HttpCredential $httpCredentials ` -Location $location ` -DefaultStorageAccountType AzureDataLakeStore ` -DefaultStorageAccountName "$dataLakeStoreName.azuredatalakestore.net" ` -DefaultStorageRootPath $clusterRootPath ` -Version 3.5 ` -SshCredential $sshuserCredentials ` -AadTenantId $tenantId ` -ObjectId $servicePrincipalobjectId ` -Config $azureHDInsightConfig ` -debug -ErrorVariable $errorVar ` # output any error Write-Output $errorVar
Upon executing the script, go to the Azure Portal and see if the HDInsight cluster is being created. If you see Applying changes then it is in progress. This takes about 15-20 minutes.
Pain Points
- General typos such as file paths with too many slashes and misspellings. Just be extra careful.
- The following error was a false positive. The .pfx file indeed has a private key.
- "errors": [ - { - "code": "InvalidDocumentErrorCode", - "message": "DeploymentDocument 'AmbariConfiguration_1_7' failed the validation. Error: 'Error while getting access to the datalake storage account rkdata - lake: The private key is not present in the X.509 certificate..'" - }
To resolve, I had the certificate configured as
Add-AzureRmHDInsightClusterIdentity ` -AadTenantId $tenantId ` -ObjectId $servicePrincipalobjectId ` -CertificateFilePath $certificateFilePath ` -CertificatePassword $certificatePassword
rather than -CertificateFIleContents
New-AzureRmHDInsightCluster : Response status code indicates server error: 500 (InternalServerError). At line:3 char:1 + New-AzureRmHDInsightCluster ` + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : CloseError: (:) [New-AzureRmHDInsightCluster], CloudException + FullyQualifiedErrorId : Microsoft.Azure.Commands.HDInsight.NewAzureHDInsightClusterCommand
Debugging Tips
-Debug switch
The –Debug switch allows seeing various execution diagnostics and any specific error message. For example and in some cases, able to see some details of the HTTP request and HTTP response along with JSON data.
– Error Variable
https://blogs.msdn.microsoft.com/powershell/2006/11/02/erroraction-and-errorvariable/
Other Tips
Don’t run provisioning script right immediately after deleting cluster and confirmation. Wait a couple of minutes so that all the Azure resources behind the scenes are cleared up even if the Azure Portal UI confirms it has been deleted.
This is very helpful. Big thank you for posting it!!
No problem. Thank you for your encouraging words. Motivates me to do more. 🙂