Machine Learning (6) - Binary Classification: Flight Delays, Surviving the Titanic and Targeted Marketing


This is part of the Machine Learning series.

Binary classification is the simplest kind of machine learning problem. The goal of binary classification is to categorise data points into one of two buckets: 0 or 1, true or false, to survive or not to survive, blue or no blue eyes, etc.

In this blog post, I will go through three example scenarios of binary classification:

  1. Surviving the Titanic if you were a passenger: either you have survived or not survived the sinking of the Titanic (assuming you were on the Titanic).
  2. Flight delays - admittedly, flight delays would qualify as a regression problem: there are no finitely many buckets you want to categorise some flight delay into. In this scenario, however, I have transformed the regression problem into a classification problem, since I only want to know if a flight will have a delay of more than 15 minutes or less.
  3. Targeted marketing - categorising customers into likely bike buyers or no buyers based on demographic background information. In this last scenario, I will go through the steps of a typical machine learning process: get data - clean data - build model - evaluate model - publish as web service - use web service (see diagram below). I highly recommend going through the very thorough step-by-step guide, especially if you are new to machine learning as well as Azure Machine Learning.

Surviving the Titanic

Would you have survived if you were one of the many passengers on the Titanic? I highly encourage you to try out yourself: the experiment along with some documentation can be found in the AzureML Gallery. The dataset comes from Kaggle - a platform for data scientists worldwide to compete against each other on provided datasets. The Titanic experiment is a nice entry point into the world of machine learning since it's binary classification and fun.

Some more information can be found in my blog post Surviving the Titanic - Machine Learning in Azure or this TechTalk.

Back to top

Flight Delays

As mentioned above, I have transformed a typical regression problem of flight delays into a binary classification: predicting a flight delay of more or less than 15 minutes. It is heavily based on the binary classification - flight delay prediction experiment from the AzureML Gallery, and was the main demo in my Microsoft Virtual Academy course - for more information see here.

Back to top

Targeted Marketing

Who doesn't know this situation: You get loads of newsletters that you sometimes might have a look at - but let's face it: most end up in trash or the spam folder. From the other point of view, that is the scenario every marketing person wants to avoid. As a marketing guru, you want to send out more personalised newsletters that will be read and ideally clicked on as well.

In this scenario, we make use of the AdventureWorks database: We have a wide customer base for our bike shop and want to classify our customers as likely bike buyers or not, and thus run a more targeted marketing campaign.

In this blog post, I have gone through the steps on setting up the AdventureWorks Warehouse database in an Azure SQL Database. This is a prerequisite for running the following ML experiment.

  1. Prequisites
  2. Get Data
  3. Clean Data
  4. Build Model
  5. Evaluate Model
  6. Publish as Web Service
  7. Use Web Service

Back to top

1. Prerequisites

  • AdventureWorks Warehouse set up in an Azure SQL Database (see this blog post for a step-by-step guide)
  • SQL Server Management Studio
  • Excel
  • Live ID, e.g. @outlook.com, @yahoo.com, etc.
  • Azure subscription or free trial to set up the Azure SQL Database

2. Get Data

The database of interest to us is [dbo].[vTargetMail] contained in the AdventureWorks database. Running SELECT TOP 1000 * FROM [dbo].[vTargetMail] gives us some information on vTargetMail:

We are particularly interested in the last column BikeBuyer indicating if a customer ended up buying a bike or not (hence binary classification).

Moving on to the ML Studio and having already created a new experiment, we now drag the Reader module into the canvas of ML studio to read in the data directly from the view [dbo].[vTargetMail] contained in the Azure SQL Database AdventureWorksDW2014 (see here for how to set it up).

On the right hand side (in the Properties pane) we specify the connection string as well as the credentials to read from the Azure SQL Database called AdventureWorksDW2014. In the database query we specify which columns are of interest to us when building the ML model, thus dropping columns, such as Title, FirstName etc.:

SELECT [CustomerKey]
      ,[GeographyKey]
      ,[CustomerAlternateKey]
      ,[MaritalStatus]
      ,[Gender]
      ,cast ([YearlyIncome] as int) as SalaryYear
      ,[TotalChildren]
      ,[NumberChildrenAtHome]
      ,[EnglishEducation]
      ,[EnglishOccupation]
      ,[HouseOwnerFlag]
      ,[NumberCarsOwned]
      ,[CommuteDistance]
      ,[Region]
      ,[Age]
      ,[BikeBuyer]
  FROM AdventureWorksDW2014.[dbo].[vTargetMail]

Back to Targeted Marketing Overview

3. Clean Data

Now the part comes that usually is the most time consuming one: cleaning the data. It often takes up 80% of the time of the data scientists: figuring out what of your data is actually relevant. For the sake of focussing on merely building the ML model, cleaning the data here is kept very trivial: we simply drop some more columns within the ML studio, which is the use of the Project Columns module:

By clicking on Launch column selector in the Properties pane (on the right hand side), you can specify which columns you wish to drop or select (blacklist vs. whitelist). In this case, we only want to exclude two columns; thus, we start with all columns and exclude the particular two columns (that conveniently come up as a dropdown menu): CustomerAlternateKey and GeographyKey.

Note that there are plenty more modules for cleaning data within AzureML, all listed in the catalogue pane on the left hand side under Data Transformation.

Back to Targeted Marketing Overview

4. Build Model

Once the data has been cleaned (although in this case admittedly a very trivial step), the model can be built based on the given data. Thinking a few steps ahead, how can we tell if a model is well performing or not? Hence, we split the data in 80-20: we use 80% to train a machine learning model but reserve the remaining 20% for testing the model:

Since we are dealing with a binary classification problem, the machine learning algorithms of interest to us are the "Two-Class ..." algorithms marked in red. Here, we choose the Two-Class Boosted Decision Tree (in green):

Consider these algorithms as some empty templates that are useless without any data to be trained on. Hence, the module Train Model explicitly specifies which ML algorithm to use (1st input) and which data to train the algorithm on (2nd input):

Notice the red exclamation mark in the "Train Model" module: We need to specify which column the model is supposed to "predict". Click on Launch Column Selector in the Properties pane:

Similar to the Column Selector in the Project Columns module, we can choose a column name from a dropdown menu. In our case, it is BikeBuyer (remember, we want to classify customers based on their demographic information if he/she is likely to buy a bike or not):

Now that the model has been trained on specific data (i.e. [dbo].[vTargetMail]), we can test how well the model is doing on test data, i.e. the remaining 20% that we have put aside when splitting the data. For this purpose, you use the module called Score Model - another way of saying "Apply trained model on data...".

Now we come to one of the killer features of AzureML: You can easily compare different machine learning algorithms with each other. So we just copy paste the modules Train Model and Score Model within the ML studio: Select the modules "Train Model" and "Score Model"...

...right-click on them to copy...

...paste these two modules anywhere within the ML Studio...

...and drag in another Two-Class classification algorithm: Two-Class Bayes Point Machine:

Lines need to be dragged the same way as with training and scoring the boosted decision tree on the left hand side:

Finally, the module Evaluate Model allows one to quickly compare the two trained models using various evaluation metrics, such as the ROC-curve, the ration between precision and recall, and the lift-curve - more in the next section:

Run the experiment!

Back to Targeted Marketing Overview

5. Evaluate Model

You can see that the experiment has finished running in the top right corner and the green ticks in each module:

When clicking on the little circle in the Evaluate Model module, you can visualise the evaluation metrics of the two ML trained models as well as compare them with each other:

The metrics provided are the ROC-curve, the precision-recall diagram and the lift curve:

The four values in the blue box represent the confusion matrix. It is a 2x2-matrix since it is a binary classification problem; it contains the absolute number of true positives (i.e. customer has been correctly predicted to be a bike buyer), true negatives (i.e. customer has been correctly predicted to be a no-buyer), false positives (i.e. customer has falsely been predicted to be a bike buyer) and false negatives. The values in the green box a typical evaluation measure: accuracy, precision, recall and F1.

Just by looking at the ROC-curve (as well as looking at the other two diagrams), the first model outperforms the second one. Let's assume we have found our best performing machine learning model for our given problem.

Let's see what the better performing model has actually predicted. Click on the button of the left hand side Score Model and then click on Visualize: (Note you can also save the output as a dataset)

What the module Score Model has done is attach two more columns to the test dataset (i.e. the reserved 20% of the input dataset): Scored Labels and Scored Probabilities.

The calculated probability indicates the the likelihood that a given customer is a bike buyer. For instance, the very first customer is predicted to be a bike buyer with a probability of 78%, whereas the third customer has a chance of being a bike buyer of merly 2,4% - making him a "no buyer". The threshold for the probability is set at 50% by default but can also be adjusted.

Comparing the columns BikeBuyer (so-called The Truth" and Scored Labels (the predictions), one can see how well a model has performed.

Back to Targeted Marketing Overview

6. Publish as Web Service

Save the say best performing model by clicking on button at Train Model and click on Save as Trained Model:

Give the trained model a name, e.g. Targeted Marketing Model:

Save the current experiment as a new experiment, since we are now undergoing preparations for a web service that will be called as we gather information about new customers who we want to classify as a likely buyer or not:

In the web service experiment, there is no need for the "loser model" anymore - we will only require the model we have saved as a trained model. Thus, delete the modules in the canvas associated with the "loser model":

Since we have saved the trained Two-Class Boosted Decision Tree, we can replace the two modules "Two-Class Boosted Decision Tree" and "Train Model" with the trained model Targeted Marketing Model. It is listed in the catalogue pane (on the left hand side) under Trained Models:

Here, we also delete the Split module since we are now only applying the trained module - no need to train and test it in the web service.

Let's take a step back: We want to have a web service that takes in all the demographic information that we can get on a new customer. What the web service is then supposed to give out is a prediction if we are dealing with a bike buyer or not. Hence, the BikeBuyer column needs to be excluded in the input of the web service. We accomplish this by clicking on the Project Columns module, launching the column selector and excluding the column BikeBuyer:

Now for the output of our web service, we only want to know the predictions along with the probabilities. Hence, add the module Project Columns...

...and only include Scored Labels and Scored Probabilities:

The experiment is almost finished. Only things to specify are the input and output of the web service. Expand Web Service in the catalogue pane (on the left hand side) and drag in Input and Output as follows:

It is finished - time to run the experiment and then deploy the web service:

Back to Targeted Marketing Overview

7. Use Web Service

The result is a web service with its associated API key. We will go through the three (or rather four) options of using the web service: 1) Manual test, 2) downloading the Excel workbook, and 3) request/response help page.

7.1. Test

Test your web service by typing in the values for all criteria manually:

7.2. Excel workbook

The advantage of the Excel workbook is, you can simply copy paste multiple data rows from the AdventureWorks database (no need for manual typing). When opening the workbook, first enable the content:

Back in the SQL Server Management Studio with the AdventureWorks database connected, run the same query as when reading the data in ML studio. The only difference is: exclude the columns [CustomerAlternateKey], [GeographyKey] and [BikeBuyer]. Also I selected the first 10 customers with a customer key higher than some random set number:

Here the script:

  SELECT Top 10 [CustomerKey]
        ,[MaritalStatus]
        ,[Gender]
        ,cast ([YearlyIncome] as int) as SalaryYear
        ,[TotalChildren]
        ,[NumberChildrenAtHome]
        ,[EnglishEducation]
        ,[EnglishOccupation]
        ,[HouseOwnerFlag]
        ,[NumberCarsOwned]
        ,[CommuteDistance]
        ,[Region]
        ,[Age]
        ,[BikeBuyer]
    FROM AdventureWorksDW2014.[dbo].[vTargetMail]
    WHERE CustomerKey > 11110

These are the first 10 customers then:

Copy all entries except for the last column (this is the column we aim to predict with the web service) into the Excel workbook. Give it a few minutes while the last two columns (Predicted Values in green) are being calculated by calling the web service:

It turns out that our web service has incorrectly predicted bike buyers in two cases (marked in red). In fact, these have been identified as bike buyers with a probability of less than 90% - maybe the threshold for classifying a customer as a buyer needs to be raised from 50% to 90%?

7.3. Request/Response API Documentation

And the final option (or more precisely the final two options) are documentation pages for request/response or batch execution manners. Hence, when you want to integrate your machine learning models in productive code, e.g. apps, dashboards, etc., these are the pages to refer to.

When scrolling down to the very end, you will also get sample code in C#, Python and R, that you can just paste into your application code.

Back to Targeted Marketing Overview

Back to top

Further Resources

Free online course in the Microsoft Virtual Academy: Building Recommendation Systems in Azure

2 | Targeted Marketing (Binary Classification)Olivia Klose

comments powered by Disqus