Mahout and HDInsight (3) – Step-by-Step: Mahout with HDInsight PowerShell Style

In the blog series Mahout and HDInsight options on how to use Mahout in HDInsight are being explored and elaborated.

Contents

  1. What is Mahout?
  2. Step-by-Step: Mahout with HDInsight Interactive Style
  3. Step-by-Step: Mahout with HDInsight PowerShell Style

Step-by-Step: Mahout with HDInsight PowerShell Style

In this episode of the series Mahout for Dummies, we deal with Mahout on HDInsight in a PowerShell manner. Ultimately, we go through the Random Forest Scenario detailed in previous post.

  1. Upload Data
  2. Create HDInsight Cluster
  3. Mahout: General PowerShell command
  4. Scenario: Random Forest
    1. Build forest
    2. Classify test data
  5. Clean up
  6. Scenario: Recommender Job
  7. Wrapping up…

1. Upload Data

Here, we upload all data to the Azure blob storage necessary to build a random forest model from and then to test the model on. More specifically, training and test data will be uploaded. Note that information on the storage account (e.g. container name and storage context) must already be known.

## 1. File Paths
# Data stored locally
$localTrain = "C:\<TrainingDataPath>\KDDTrain+.arff"
$localTest = "C:\<TestDataPath>\KDDTest+.arff"
# Data to be stored in Azure Blob Storage
$blobTrain = "testdata/KDDTrain+.arff"
$blobTest = "testdata/KDDTest+.arff"

## 2. Upload file from local to Azure Blob Storage
Set-AzureStorageBlobContent -File $localTrain -Container $containerName `
    -Blob $blobTrain -Context $storageContext
Set-AzureStorageBlobContent -File $localTest -Container $containerName `
    -Blob $blobTest -Context $storageContext 

Since Mahout is not installed on any HDInsight cluster by default (and hence not supported by Microsoft), the Mahout jar files also shall have to be uploaded to the blob storage.

## 1. File Paths
# Mahout jar files stored locally
$localMahoutJar = "C:\<PathToMahoutDistribution>\mahout-core-0.9-job.jar"
$localMahoutEx = "C:\<PathToMahoutDistribution>\mahout-examples-0.9-job.jar"
# Mahout jar files to be stored in Azure Blob Storage
$blobMahoutJar = "mahout/mahout-core-0.9-job.jar"
$blobMahoutEx = "mahout/mahout-examples-0.9-job.jar"

## 2. Upload file from local to Azure Blob Storage
Set-AzureStorageBlobContent -File $localMahoutJar -Container $containerName `
    -Blob $blobMahoutJar -Context $storageContext
Set-AzureStorageBlobContent -File $localMahoutEx -Container $containerName `
    -Blob $blobMahoutEx -Context $storageContext 

2. Create HDInsight Cluster

We just create a simple HDInsight cluster, just like in the Azure PowerShell Series: Simple HDInsight. Alternatively, you could create one with additional functionality; see Azure PowerShell Series: Custom Create HDInsight.

# Input
$clusterName = "<HDInsightClusterName>"
$clusterCreds = Get-Credential
$numNodes = 4

# Simple create
New-AzureHDInsightCluster -Name $clusterName -Subscription $subID `
    -Location $location -DefaultStorageAccountName $storageAccount `
    -DefaultStorageAccountKey $storageKey `
    -DefaultStorageContainerName $containerName -Credential $clusterCreds `
    -ClusterSizeInNodes $numNodes -Version 2.1 

In the Azure Explorer, you observe some libraries being uploaded, such as mapred, hive, etc.

Just like in the previous post Step-by-Step: Mahout with HDInsight Interactive Style, both the training and test data need to be located in the directory user/hdp/

$blobHDPtrain = "user/hdp/testdata/KDDTrain+.arff"
$blobHDPtest = "user/hdp/testdata/KDDTest+.arff"
Set-AzureStorageBlobContent -File $localTrain -Container $containerName `
    -Blob $blobHDPtrain -Context $storageContext
Set-AzureStorageBlobContent -File $localTest -Container $containerName `
    -Blob $blobHDPtest -Context $storageContext 

3. Mahout: General PowerShell Command

The typical command for invoking Mahout from the Hadoop Command Line via RDP connection looks as follows:

hadoop jar C:\apps\dist\mahout-0.9\mahout-core-0.9-job.jar 
org.apache.mahout.classifier.df.tools.Describe 
-p wasb:///user/hdp/testdata/KDDTrain+.arff ... 

Thus, it is an ordinary command running the program contained in specified JAR file. org.apache.mahout.classifier.df.tool.Describe is the class name being invoked, followed by mandatory and optional arguments. Translated into PowerShell:

$mahoutJob = New-AzureHDInsightMapReduceJobDefinition `
    -JarFile  "<PathToMahoutJAR>/mahout-core-0.9-job.jar" `
    -ClassName "<ClassName>" `
    -Arguments "-p wasb:///user/hdp/testdata/KDDTrain+.arff …" 

In the case above, this translates into the following PowerShell command:

$mahoutJob = New-AzureHDInsightMapReduceJobDefinition `
    -JarFile "wasb://$containerName@$storageAccount.blob.core.windows.net/$blobMahoutJar" `
    -ClassName "org.apache.mahout.classifier.df.tools.Describe" `
    -Arguments "-p wasb:///user/hdp/$blobTrain -f testdata/KDDTrain+.info `
    -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L" 

or a little more elaborate:

$mahoutJob = New-AzureHDInsightMapReduceJobDefinition `
    -JarFile  "wasb://$containerName@$storageAccount.blob.core.windows.net/$blobMahoutJar" `
    -ClassName "org.apache.mahout.classifier.df.tools.Describe"

# path to training data
$mahoutJob.Arguments.Add("-p")
$mahoutJob.Arguments.Add("wasb:///user/hdp/$blobTrain")

# path to generated descriptor file
$mahoutDescriptor.Arguments.Add("-f")
$mahoutDescriptor.Arguments.Add("wasb:///user/hdp/testdata/KDDTrain+.info")

# attributes of given training data
$mahoutDescriptor.Arguments.Add("-d")
$mahoutDescriptor.Arguments.Add("N")
$mahoutDescriptor.Arguments.Add("3")
$mahoutDescriptor.Arguments.Add("C")
$mahoutDescriptor.Arguments.Add("2")
$mahoutDescriptor.Arguments.Add("N")
$mahoutDescriptor.Arguments.Add("C")
$mahoutDescriptor.Arguments.Add("4")
$mahoutDescriptor.Arguments.Add("N")
$mahoutDescriptor.Arguments.Add("C")
$mahoutDescriptor.Arguments.Add("8")
$mahoutDescriptor.Arguments.Add("N")
$mahoutDescriptor.Arguments.Add("2")
$mahoutDescriptor.Arguments.Add("C")
$mahoutDescriptor.Arguments.Add("19")
$mahoutDescriptor.Arguments.Add("N")
$mahoutDescriptor.Arguments.Add("L") 

Note that the PowerShell commandlets have so far only defined the job but not triggered it yet. The Hadoop Job is started by the following command:

$mahoutJobProcessing = Start-AzureHDInsightJob -Cluster $clusterName `
    -JobDefinition $mahoutJob -Credential $clusterCreds 

To automatically wait for the HDInsight job to process, you can insert the following

    Wait-AzureHDInsightJob -Job $mahoutJobProcessing -WaitTimeoutInSeconds 3600 

It gives an hour (i.e. 3600 seconds) for the HDInsight job to process . You can print out any output error as follows:

Get-AzureHDInsightJobOutput -Cluster $clusterName -Subscription $subID `
    -JobId $mahoutJobProcessing.JobId -StandardError 

4. Scenario: Random Forest

In the previous section, we elaborated on how to construct a Mahout Job as a PowerShell command. Here, we go through an example using the Random Forest, just like in the previous post Step-by-Step: Mahout with HDInsight Interactive Style – Scenario Random Forest.

4.1. Build forest

As a reminder, the command we used to build a forest in Interactive Style is the following:

hadoop jar C:\apps\dist\mahout-0.9\mahout-examples-0.9-job.jar 
org.apache.mahout.classifier.df.mapreduce.BuildForest 
-Dmapred.max.split.size=1874231 
-d wasb:///user/hdp/testdata/KDDTrain+.arff 
-ds wasb:///user/hdp/testdata/KDDTrain+.info 
-sl 5 -p -t 100 -o nsl-forest 

Thus, the “translated” PowerShell command is

## build forest
$mahoutForest = New-AzureHDInsightMapReduceJobDefinition `
    -JarFile "wasb://$containerName@$storageAccount.blob.core.windows.net/$blobMahoutEx" `
    -ClassName "org.apache.mahout.classifier.df.mapreduce.BuildForest"

# maximum data size per node
$mahoutForest.Arguments.Add("-Dmapred.max.split.size=1874231")
# data path
$mahoutForest.Arguments.Add("-d")
$mahoutForest.Arguments.Add("wasb:///user/hdp/testdata/KDDTrain+.arff")
# dataset path
$mahoutForest.Arguments.Add("-ds")
$mahoutForest.Arguments.Add("wasb:///user/hdp/testdata/KDDTrain+.info")
# number of variables being randomly selected at each node
$mahoutForest.Arguments.Add("-sl")
$mahoutForest.Arguments.Add("5")
# flag for partial implementation
$mahoutForest.Arguments.Add("-p")
# number of trees
$mahoutForest.Arguments.Add("-t")
$mahoutForest.Arguments.Add("100")
# output path for generated forest
$mahoutForest.Arguments.Add("-o")
$mahoutForest.Arguments.Add("nsl-forest")

# start job
$mahoutForestProcessing = Start-AzureHDInsightJob -Cluster $clusterName `
    -JobDefinition $mahoutForest

# wait for job
Wait-AzureHDInsightJob -Subscription $subID -Job $mahoutForestProcessing `
    -WaitTimeoutInSeconds 3600

# print out error if any
Get-AzureHDInsightJobOutput -Cluster $clusterName -Subscription $subID `
    -JobId $mahoutForestProcessing.JobId -StandardError 

The output in PowerShell should look like this:

4.2. Classify test data

The “converted” PowerShell command of the classifying command proposed in Interactive Style is as follows:

$mahoutClassify = New-AzureHDInsightMapReduceJobDefinition `
    -JarFile "wasb://$containerName@$storageAccount.blob.core.windows.net/$blobMahoutEx" `
    -ClassName "org.apache.mahout.classifier.df.mapreduce.TestForest"

$mahoutClassify.Arguments.Add("-i")
$mahoutClassify.Arguments.Add("wasb:///user/hdp/testdata/KDDTest+.arff")
$mahoutClassify.Arguments.Add("-ds")
$mahoutClassify.Arguments.Add("wasb:///user/hdp/testdata/KDDTrain+.info")
$mahoutClassify.Arguments.Add("-m")
$mahoutClassify.Arguments.Add("wasb:///user/hdp/nsl-forest")
$mahoutClassify.Arguments.Add("-a")
$mahoutClassify.Arguments.Add("-mr")
$mahoutClassify.Arguments.Add("-o")
$mahoutClassify.Arguments.Add("predictions")

$mahoutClassifyJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $mahoutClassify
Wait-AzureHDInsightJob -Job $mahoutClassifyJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $mahoutClassifyJob.JobId -StandardError 

Note that the Output shown above has the same format and very similar results to the previous post when done in interactive style.

5. Clean up

Cleaning up involves the likes of removing and HDInsight cluster but also removing temporary directories. While the PowerShell command for deleting a single file is pretty straight forward, i.e.

Remove-AzureStorageBlob -Container $containerName -Context $storageContext -Blob $file 

deleting a folder structure comprises a loop in which every single file with specified file path prefix is removed.

## a. Remove temp directory
$blobPrefix = "user/hdp/temp"
$tempFiles = Get-AzureStorageBlob -Container $containerName -Context $storageContext -prefix $blobPrefix

Write-Host "Removing temp directory"
foreach ($item in $tempFiles){
    $tmpFile = $item.Name
    Write-Host "Deleting $tmpFile"
    Remove-AzureStorageBlob -Container $containerName -Context $storageContext -Blob $tmpFile
}


## b. Delete HDInsight cluster
Remove-AzureHDInsightCluster -Name $clusterName 

6. Scenario: Recommender Job

As we saw in the first part of our Mahout and HDInsight series, there are many algorithms included in the Mahout library other than the random forest.

In the blog by the Big Data Support team at Microsoft, there is a good post demonstrating the use of the RecommenderJob class on an HDInsight Cluster using PowerShell that you can read here. The source code of the RecommenderJob class can be looked up here on GitHub.

In this scenario, we are given two data files: one containing user ID’s and the other comprising the degrees of preference of users towards given items:

In ItemID.txt, the first column indicates the user ID, the second the item IDs and the final one denotes the degree of preference. Thus, ItemID.txt can be expressed in a more intuitive Format of a matrix, where the rows indicate the user and the columns denote the item IDs. The values inside the matrix themselves display the degree of preference, as given in the third column in ItemID.txt.

Here is the comprised PowerShell script for running RecommenderJob as in the Big Data Support blog.

##########################################################################################
# Mahout with HDInsight: RecommenderJob (Collaborative Filtering)
#
# Check out Microsoft's Big Data Support blog
# http://blogs.msdn.com/b/bigdatasupport/archive/2014/02/19/mahout-with-hdinsight.aspx
#
# Source code in GitHub:
# https://github.com/apache/mahout/blob/trunk/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java



##########################################################################################
# 0. Azure Account Details
Add-AzureAccount
$subName = "<AzureSbscriptionName>"
Select-AzureSubscription $subName

# Azure account details automatically set
$subID = Get-AzureSubscription -Current | %{ $_.SubscriptionId } 



##########################################################################################
## 1. Input information


## a. storage account
$storageAccount = "<StorageAccountName>"
$containerName = "<StorageContainerName>"
$location = "<DatacenterLocation>" #e.g. North Europe

# if storage account not created yet
#New-AzureStorageAccount -StorageAccountName $storageAccount -Location $location
#Set-AzureStorageAccount -StorageAccountName $storageAccount -GeoReplicationEnabled $false

# Variables automatically set for you
$storageKey = Get-AzureStorageKey $storageAccount | %{ $_.Primary } 
$storageContext = New-AzureStorageContext -StorageAccountName $storageAccount -StorageAccountKey $storageKey
$fullStorage = "${storageAccount}.blob.core.windows.net"

# if container not created yet
New-AzureStorageContainer -Name $containerName -Context $storageContext



## b. HDInsight Cluster
$clusterName = "<HDInsightClusterName>"
$clusterCreds = Get-Credential -Message "New admin account to be created for your HDInsight cluster"
# best: user name = admin
$numNodes = 4



## c. Data
# Data stored locally
$localFolder = "C:\<localFilesPath>"
$localItems = "$localFolder\ItemID.txt"
$localUsers = "$localFolder\users.txt"
$localMahoutJar = "C:\<PathToMahoutDistribution>\mahout-core-0.9-job.jar"

# Data to be stored in Azure Blob Storage
$blobMahoutJar = "mahout/mahout-core-0.9-job.jar"
$blobFolder = "testdata"
$blobItems = "$blobFolder/ItemID.txt"
$blobUsers = "$blobFolder/users.txt"



##########################################################################################
# 2. Upload file from local to Azure Blob Storage

# Mahout jar
Write-Host "Copying Mahout JAR into Blob Storage" -BackgroundColor Green
Set-AzureStorageBlobContent -File $localMahoutJar -Container $containerName -Blob $blobMahoutJar -Context $storageContext

# data for RecommenderJob
Write-Host "Copying necessary data into Blob Storage" -BackgroundColor Green
Set-AzureStorageBlobContent -File $localItems -Container $containerName -Blob $blobItems -Context $storageContext
Set-AzureStorageBlobContent -File $localUsers -Container $containerName -Blob $blobUsers -Context $storageContext




##########################################################################################
# 3. Create HDInsight Cluster

# Simple create
New-AzureHDInsightCluster -Name $clusterName -Subscription $subID -Location $location `
    -DefaultStorageAccountName $storageAccount -DefaultStorageAccountKey $storageKey `
    -DefaultStorageContainerName $containerName -Credential $clusterCreds -ClusterSizeInNodes $numNodes `
    -Version 2.1



##########################################################################################
# 4. Mahout


# Mahout Job defining the appropriate JAR file and the class name
$mahoutJob = New-AzureHDInsightMapReduceJobDefinition `
    -JarFile "wasb:///$blobMahoutJar" `
    -ClassName "org.apache.mahout.cf.taste.hadoop.item.RecommenderJob"

# Similarity class name.
# Alternative similarity classes: loglikelihood, tanimoto coeff, 
# city block, cosine, pearson correlation, euclidean distance
$mahoutJob.Arguments.Add("-s")
$mahoutJob.Arguments.Add("SIMILARITY_COOCCURRENCE")

# Input path to file with preference data
$mahoutJob.Arguments.Add("-i")
$mahoutJob.Arguments.Add("wasb:///$blobItem")

# path to file containing use IDs for which recommendations will be computed
$mahoutJob.Arguments.Add("--usersFile")
$mahoutJob.Arguments.Add("wasb:///$blobUsers")

# path for recommender output
$mahoutJob.Arguments.Add("--output")
$mahoutJob.Arguments.Add("wasb:///$blobFolder/output")

# Starting job
$mahoutJobProcessing = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $mahoutJob -Debug

# Waiting Job for completion
Wait-AzureHDInsightJob -Job $mahoutJobProcessing -WaitTimeoutInSeconds 3600 -Debug

# Getting error if any
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $mahoutJobProcessing.JobId -StandardError



##########################################################################################
# 5. Clean up, i.e. remove temp directory


## a. Remove temp directory
$blobPrefix = "user/hdp/temp"
$tempFiles = Get-AzureStorageBlob -Container $containerName -Context $storageContext -prefix $blobPrefix

Write-Host "Removing temp directory"
foreach ($item in $tempFiles){
    $tmpFile = $item.Name
    Write-Host "Deleting $tmpFile"
    Remove-AzureStorageBlob -Container $containerName -Context $storageContext -Blob $tmpFile
}


## b. Delete HDInsight cluster
Remove-AzureHDInsightCluster -Name $clusterName 

The output files can be seen in the Azure Blob Storage Explorer as ususal:

The output file itself gives information on how likely which items could be of interest to which users, i.e. user-id [item-id: degree-of-preference / interest,…].

In such a way, we can insert the recommendations with their scores in the matrix from above:

7. Wrapping up...

In this blog post, we went through two scenarios applying Mahout on HDInsight in PowerShell style: random forest and recommender. These scenarios are nicely wrapped around the usual suspects: uploading data, creating the HDInsight cluster and cleaning up afterwards.

Many thanks go to Alexei Khalyako and Bill Carroll for their support on Mahouting on HDInsight!

comments powered by Disqus