In the blog series on Mahout and HDInsight, we now get our hands dirty. Let’s see Mahout in action on an HDInsight cluster.
- What is Mahout?
- Step-by-Step: Mahout with HDInsight Interactive Style
- Step-by-Step: Mahout with HDInsight PowerShell Style
Step-by-Step: Mahout with HDInsight Interactive Style
But before heading right into Mahout, the HDInsight cluster shall be created.
Please note that as of now, Mahout is NOT supported by Microsoft. Mahout is by default not installed on any HDInsight Cluster, but can be done in various ways, such as connecting to the head node via RDP or using PowerShell (see next article).
Update: Mahout is contained in the HDInsight Version 3.1 by default; more information can be found in the documentation What’s new in the Hadoop cluster Versions provided by HDInsight? Thus the second step Install Mahout on HDInsight can be skipped.
This article is a step-by-step guide on how to install and use Mahout on an HDInsight Cluster with a specific scenario involving Random Forests.
- Create HDInsight Cluster
- Install Mahout on HDInsight
- Scenario: Random Forest
- Get data
- Generate descriptor file
- Build forest
- Classify test data
- What is Happening?
- Wrapping up…
1. Create HDInsight Cluster
Prerequisite: You have already created a storage account.
Start off with going to the Microsoft Azure portal and click on new.
Pick a Name for your HDInsight cluster. Please note that we are using HDInsight Version 2.1. Set the datacenter region to the same region that your storage account is located in, in this case North Europe.
Obviously you configure credentials to your HDInsight Cluster:
As mentioned above, the prerequisite is that you have already created a storage account. To have a clean slate, I create a default container on which the HDInsight Cluster is based.
The process of creating an HDInsight Cluster includes a few Milestones such as configuring the provisioned virtual machines. The beauty of HDInsight is that you do not need to provision so-and-so-many virtual machines and install Hadoop on them – HDInsight provides this as a Service and does it automatically.
Once created, let us enable a remote desktop connection:
And let’s connect!
2. Install Mahout on HDInsight
Mahout is provided by HDInsight Version 3.1 by default. As you have connected remotely to it, open the file explorer and browse to C:\apps\dist. There you can see a list of Hadoop components supported by HDInsight 3.1:
Hence, you can ignore the rest of this paragraph and skip right to 3. Scenario: Random Forest.
In case you do deal with an earlier HDInsight Version (earlier than 3.1), then follow the steps described in this paragraph.
You can find the latest release version of Mahout on http://mahout.apache.org/ that you can download locally on your computer.
In the head node of your HDInsight Cluster (you have connected to it via RDPin the end of 1. Create HDInsight Cluster), open the File Explorer to create a new folder C:\ , such as C:\MyFiles.
Copy the Mahout Distribution zip file into C:\MyFiles in the head node. In this case, the latest release is version 0.9.
Extract the zip-file into C:\apps\dist.
Rename the extracted folder into mahout-x.x where x.x is the version number.
And that’s it – Mahout is installed on your HDInsight cluster!
3. Scenario: Random Forest
This step-by-step guide is based on the one documented in Mahout – Classifying with random forests only tailored to HDInsight 2.1.
3.1. Get data
Before building a forest model, we need data to learn from as well as data to test our model from. The data sets used here can be downloaded from http://nsl.cs.unb.ca/NSL-KDD/. I am using KDDTrain+ARFF and KDDTest+.ARFF.
Check in the downloaded files that unnecessary lines are removed. More specifically, in KDDTrain+.arff remove the first 44 lines (i.e. all lines starting with @attribute). Otherwise, we will not be able to generate a descriptor file later on in 3.2.
Copy these two data files into C:\MyFiles in the head node (just like with the Mahout Distribution earlier).
Ok, we have our training and test data in our HDInsight cluster, but for Mahout do its magic on the precious data, the data needs to be in the HDFS (Hadoop Distributed File System). Yes, you are right – in HDInsight, we do not use HDFS; all data is stored in the Azure Blob Storage. Instead, the HDFS API is still used in HDInsight. So, all we need to do is copy local data into the Blob Storage.
There are many ways transferring local data into the blob storage. Here, we use the Hadoop shell commands. First we create a directory called testdata. Then we copy.
hdfs dfs -mkdir testdata
hdfs dfs -copyFromLocal C:/MyFiles/KDDTrain+.arff testdata/KDDTrain+.arff
hdfs dfs -copyFromLocal C:/MyFiles/KDDTest+.arff testdata/KDDTest+.arff
Many use the shell command
hadoop fs but this command is deprecated.
Here’s a tip to avoid typing in the whole path: Copy path in the file explorer.
To see what is stored in all my storage accounts, I often use cerebrata’s Azure Explorer, but there are many other storage explorers recommended by us all listed here.
To double check, we see that the now copied data files are located in user/testdata/. Note that olivia is the user I configured to be the remote user, hence can remotely connect to the head node of my HDInsight cluster.
The way Mahout is compiled, the data needs to be in user/hdp/ though. In the Hadoop command line we’ll type in:
hdfs dfs -cp wasb://email@example.com/user/olivia/testdata/KDDTrain+.arff wasb://firstname.lastname@example.org/user/hdp/testdata/KDDTrain+.arff
hdfs dfs -cp wasb://email@example.com/user/olivia/testdata/KDDTest+.arff wasb://firstname.lastname@example.org/user/hdp/testdata/KDDTest+.arff
More generally, just replace the variables in <> with the names you have chosen accordingly (i.e. container, storage account and remote user):
hdfs dfs -cp
hdfs dfs -cp
Checking in Azure Explorer, the two data files can be found under user/hdp/testdata as desired.
3.2. Generate descriptor file
Before building a random forest model based on the training data in KDDTrain+.arff, a descriptor file is essential. Why? When building the model, all information in the training data needs to be labelled for the algorithm to know, which one is numerical, categorical or a label.
The command is as follows:
hadoop jar C:\apps\dist\mahout-0.9\mahout-core-0.9-job.jar
-p wasb:///user/hdp/testdata/KDDTrain+.arff -f testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
Here, the main class of org.apache.mahout.classifier.df.tools.Describe is invoked; for more information on the source code, check out Mahout’s GitHub site on Describe.java. It takes three mandatory arguments, namely: p (short for path), f (short for file) and d (short for descriptor). Other optional arguments are h (help), r (regression) and Options. The p argument specifies the path where the data to be described is located, f defines the location for generated descriptor file and d provides Information on all attributes of given data, where N=numerical, C=categorical and L=label. More specifically, ´N 3 C 2 N C 4 N C 8 N 2 C 19 N L` means that given data set starts off with a numerical attribute (N), followed by 3 categorical attributes (C), etc. and lastly with a Label (L).
Update: Note that if you use an HDInsight cluster based on version 3.1, change the path specifying the mahout jar file to
Since the descriptor file also needs to be in the directory user/hdp/, you either then copy the generated descriptor file into user/hdp/ or you might as well set parametre f to wasb:///user/hdp/testdata/KDDTrain+.info
Or generating the descriptor file in user/hdp/ straight away:
hadoop jar C:\apps\dist\mahout-0.9\mahout-core-0.9-job.jar
-p wasb:///user/hdp/testdata/KDDTrain+.arff -f wasb:///user/hdp/testdata/KDDTrain+.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L
Update: If you use an HDInsight cluster of version 3.1, use the following path specifying the Mahout jar:
Checking in the Azure Explorer, we see KDDTrain+.info in the Directory user/hdp/testdata:
3.3. Build forest
Now we can finally build the random forest using the following command in the Hadoop command line:
hadoop jar C:\apps\dist\mahout-0.9\mahout-examples-0.9-job.jar
-Dmapred.max.split.size=1874231 -d wasb:///user/hdp/testdata/KDDTrain+.arff -ds wasb:///user/hdp/testdata/KDDTrain+.info -sl 5 -p -t 100 -o nsl-forest
In GitHub, you can look into the source code of used class (BuildForest) on GitHub. Thus, the mandatory arguments of the main class in BuildForest are d (data), ds (dataset), t (nbtrees), o (output). A more comprehensive list of arguments is the following:
- data (d): Data path (Mandatory)
- dataset (ds): Dataset path (Mandatory)
- selection (sl): #Variables to select randomly at each tree node (Optional)
- Classification: Default = square root of #explanatory vars.
- Regression: Default = 1/3 of #explanatory vars.
- no-complete (nc): Tree is not complete (Optional)
- minsplit (ms): Tree node is not divided, if branching data size < given value (Optional). Default: 2.
- minprop (mp): Tree node is not divided, if Proportion of the variance of branching data < given value (Optional)
- Used for Regression. Default: 0.001.
- seed (sd): Seed value used to initialise the random number generator (Optional)
- partial (p): Use partial data implementation (Optional)
- nbtrees (t): #Trees to grow (Mandatory)
- output (o): Output path that will contain the decision forest (Mandatory)
- help (h): Help (Optional)
In other words, a random forest model is computed on the basis of data provided in KDDTrain+.arff with additional description information in KDDTrain+.info with 100 trees saved in the directory nsl-forest/. The computation of a random forest uses the partial implementation (-p) and splits the dataset at each tree node by randomly selecting 5 attributes (-sl), whilst allowing a maximum of 1,874,231 data units per node (-Dmapred.max.split.size). Note that the maximum number of data units per node also indicates the partition size of each tree in the random forest, in this case 1/10 of the dataset; thus, 10 partitions are being used.
Update: As mentioned in 3.2 Generate Descriptor File, if you use an HDInsight cluster of version 3.1, use the following path for the mahout example jar:
The result in the Hadoop command line will look like this:
In the end, we can see how long it took to build the forest and also obtain further Information on the forest, such as the number of nodes or mean maximum depth of the forest.
To use the generated forest for classifying unknown test data, we copy it via hdfs commands into user/hdp/ as follows:
hdfs dfs -cp wasb://email@example.com/user/olivia/nsl-forest wasb://firstname.lastname@example.org/user/hdp/nsl-forest
or more generally speaking:
hdfs dfs -cp wasb://<container>@<storageaccount>.blob.core.windows.net/user/<remoteuser>/nsl-forest wasb://<container>@<storageaccount>.blob.core.windows.net/user/hdp/nsl-forest
In Azure Explorer you can see that the generated forest model (forest.seq) is stored in both user/olivia/ and user/hdp/.
3.4. Classify test data
We have generated a forest model in the step before in order to automatically classify new incoming data, i.e. KDDTest+.arff. The command we use here is
hadoop jar C:\apps\dist\mahout-0.9\mahout-examples-0.9-job.jar
-i wasb:///user/hdp/testdata/KDDTest+.arff -ds wasb:///user/hdp/testdata/KDDTrain+.info -m wasb:///user/hdp/nsl-forest -a -mr -o predictions
Update: If you use a cluster of HDInsight 3.1, use the following path name for the mahout examples jar file:
As usual more information can be found on Mahout’s GitHub site, more concretely in TestForest.java. What do the arguments mean? The mandatory arguments are input (-i) for the test data location, dataset (-ds) for the descriptor file location, model (-m) for the forest model location and output (-o) for the output location; optional boolean arguments are analyze (-a) for analysing the classification results, i.e. computing the confusion matrix, and mapreduce (-mr) to use Hadoop to distribute classification.
In this case, predictions are computed for the new test data located in
wasb:///user/hdp/testdata/KDDTest+.arff with its associated descriptor file in
wasb:///user/hdp/testdata/KDDTrain+.info using the previously built random forest in
wasb:///user/hdp/nsl-forest; the output predictions are then stored in a text file in the directory
predictions/. Additionally, a confusion matrix is computed (as you can see below in the Hadoop command line) and classification is being distributed using Hadoop.
The predictions are stored in user/olivia/predictions:
3.5. Woah, what is happening?
Ok, so what just happened? Let’s first have a look at the summary in the Hadoop command line and another closer look at the output file containing the predictions.
The test data in KDDTest+.arff contained 22,544 instances that were classified in 1.4. Classify new data, and that you can also see in the first section Summary under Total Classified Instances. Thus you can see that 17,783 instances of them (i.e. 78%) were correctly classified, whereas 4,761 (21%) were incorrectly classified.
More details are provided in the confusion matrix, in nicer view:
In other words, 9,458 normal instances were correctly classified but 253 normal instances were incorrectly classified as anomaly, adding up to 9,711 actual normal instances. There are 17,783 correctly classified instances (= 9,458 + 8,325, i.e. normal-normal + anomaly-anomaly) compared to 4,761 (= 4,508 + 253) incorrectly classified instances.
The remaining statistics measures (Kappa and reliability) indicate a degree on overall consistency of measure, more specifically of agreement between raters.
And finally, what about the predictions that have been saved as an output of classifying the test data?
After some converting, you obtain a list of numbers of type double. What each double number indicates is the predicted category of each data instance, where 1.0 denotes the category anomaly.
4. Wrapping up…
We have created an HDInsight cluster such that the Mahout library (in this case version 0.9) could subsequently be installed. The Mahout library is very extensive and can be explored at its full glory in its GitHub site. Here, we went through a scenario using one of many Machine Learning religions, namely the Random Forest, based on the random forest tutorial on the Mahout site but tailored to HDInsight.
Update: There is an extensive guide on how to use Mahout on HDInsight to generate movie recommendations, found here on the Azure documentation – highly recommendable!
In the next Mahout article, we will explore the use of Mahout through the awesomeness of PowerShell.