Mahout and HDInsight (1) – What is Mahout?

Contents

  1. What is Mahout?
  2. Step-by-Step: Mahout with HDInsight Interactive Style
  3. Step-by-Step: Mahout with HDInsight PowerShell Style

What is Mahout?

(Apache Mahout is one of many Hadoop-related projects at Apache. Its mission is to build a scalable machine learning and data mining library. In other words, Mahout provides data science tools useful for detecting meaningful patterns in given data sets that are stored in HDFS (Hadoop Distributed File System). It is implemented on top of Hadoop and as of version 0.9 based on the infamous MapReduce paradigm.

Why the word Mahout? Traditionally, a mahout is an elephant rider and has its origins in the Hindi language. The mahout starts early on as a boy when being assigned an elephant.

Well, back to the machine learning library – that would contain numerous algorithms! Mahout is based on three “C-pillars” of machine learning implementations:

  • Collaborative filtering (aka recommendation),
  • Clustering, and
  • Classification.

Collaborative Filtering (aka Recommendation)

You were looking at a product in Amazon, and there it is – a list of items recommended to you based on what other users also considered buying when looking at “your” product. Such recommender engines (also to be found in Netflix, Spotify, etc.) comprises all kinds of collaborative filtering algorithms. User behaviour is being mined to observe patterns and use as a recommendation for other users with similar likes and dislikes.
(Picture credits to Customers Who Bought This Item Also Bought, PaulsHealthBlog.com, 11.04.2014)

Clustering

This family of machine learning entails the grouping of data units into natural clusters since they share similar characteristics. For instance, you tend to cluster customers into groups according to demographic information, say, without labelling these groups yet. Or we naturally group most food into sweet or salty things.
(Picture credits to Microsoft Clustering Algorithm, 11.04.2014)

Classification

Given a dataset that we can learn from and build a data model, we then can classify new unknown data items. For instance, the eye colour is genetically influenced by more than one gene. By learning which genes would result in blue eyes, we can predict the eye colour of other people based on their genetic information.

What is the difference between clustering and classification? While in classification you are already given certain categories to classify your data, clustering involves naturally similar items. In other words, in the example of the blue eyes, we know from the beginning what we are looking for: blue eyes or no blue eyes, whereas labelling the groups shall still be established after clustering.

How does Mahout work?

Mahout provides the implementations of various ML algorithms – a list of them can be found on their site in the list of algorithms. Each one of them can be invoked via a command line. How it is done with HDInsight and PowerShell will be shown in the upcoming blog entries: Step-by-Step: Mahout with HDInsight Interactive Style and Step-by-Step: HDInsight with Mahout PowerShell Style.

comments powered by Disqus