Follow the Data

A data driven blog

Archive for the category “Tutorial”

Practical advice for machine learning: bias, variance and what to do next

The online machine learning course given by Andrew Ng in 2011 (available here among many other places, including YouTube) is highly recommended in its entirety, but I just wanted to highlight a specific part of it, namely the “Practical advice part”, which touches on things that are not always included in machine learning and data mining courses, like “Deciding what do to do next” (the title of this lecture) or “debugging a learning algorithm” (the title of the first slide in that talk).

His advice here focuses on the concepts of the bias and variance  in statistical learning. I had been vaguely aware of the concepts of “bias and variance tradeoff” and “bias/variance decomposition” for a long time, but I had always viewed those as theoretical concepts that were mostly helpful for thinking about the properties of learning algorithms; I hadn’t thought that much about connecting them to the concrete tasks of model development.

As Andrew Ng explains, bias relates to the ability of your model function to approximate the data, and so high bias is related to under-fitting. For example, a linear regression model would have high bias when trying to model a quadratic relationship – no matter how you set the parameters, you can’t get a good training set error.

Variance on the other hand is about the stability of your model in response to new training examples. An algorithm like K-nearest neighbours (K-NN) has low bias (because it doesn’t really assume anything special about the distribution of the data points) but high variance, because it can easily change its prediction in response to the composition of the training set. K-NN can fit the training data very well if K is chosen small enough (in the extreme case with K=1 the fit will be perfect) but may not generalize well to new examples. So in short, high variance is related to over-fitting.

There is usually a tradeoff between bias and variance, and many learning algorithms have a built-in way to control this tradeoff, like for instance a regularization parameter that penalizes complex models in many linear modelling type approaches, or indeed the K value in K-NN. A lot more about the bias-variance tradeoff can be found in this Andrew Ng lecture.

Now, based on these concepts, Ng goes on to suggest some ways to modify your model when you discover it has a high error on a test set. Specifically, when should you:

- Get more training examples?

(Answer: When you have high variance. More training examples will not fix a high bias, because your underlying model will still not be able to approximate the correct function.)

- Try smaller sets of features?

(Answer: When you have higher variance. Ng says, if you think you have high bias, “for goodness’ sake don’t waste your time by trying to carefully select the best features”)

- Try to obtain new features?

(Answer: Usually works well when you suffer from high bias.)

Now you might wonder how you know that you have either high bias or high variance. This is where you can try to plot learning curves for your problem. You plot the error on the training set and on the cross-validation set as functions of the number of training examples for some set of training set sizes. (This of course requires you to randomly select examples from your training set, train models on them and assess the performance for each subset.)

In the typical high bias case, the cross-validation error will initially go down and then plateau as the number of training examples grow. (With high bias, more data doesn’t help beyond a certain point.) The training error will initially go up and then plateau at approximately the level of the cross-validation error (usually a fairly high level of error). So if you have similar cross-validation and training errors for a range of training set sizes, you may have a high-bias model and should look into generating new features or changing the model structure in some other way.

In the typical high variance case, the training error will increase somewhat with the number of training examples, but usually to a lower level than in the high-bias case. (The classifier is now more flexible and can fit the training data more easily, but will still suffer somewhat from having to adapt to many data points.) The cross-validation error will again start high and decrease with the number of training examples to a lower but still fairly high level. So the crucial diagnostic for the high variance case, says Ng, is that the difference between the cross-validation error and the training set error is high. In this case, you may want to try to obtain more data, or if that isn’t possible, decrease the number of features.

To summarize (using pictures from this PDF):

- Learning curves can tell you whether you appear to suffer from high bias or high variance.

- You can base your next step on what you found using the learning curves:

I think it’s nice to have this kind of rules of thumb when you get stuck, and I hope to follow up this post pretty soon with another one that deals with a relatively recent paper which suggests some neat ways to investigate a classification problem using sets of classfication models.

A little tutorial on mapreduce.

This is a short tutorial to explain the concept of map/reduce. This tutorial can be executed on a Unix system, like Linux or OS X. We’ll first process the data sequentially and then with parallel mapper tasks. As a simple example we will try to compile a list of prime numbers from some text files containing numbers (some prime, some not) and then calculate the sum of all the primes found. Finding primes can be parallelized and is thus on the map side of the algorithm but calculating the sum cannot and is therefore our reduce function. Let’s first start out with creating some test data that is easy to debug, and small, so it’ll run fast. We’ll do this in a terminal shell using ruby. The -e options tells ruby to evaluate the string, and the “>” redirects the output to the filename after.

$ruby -e "(1..10).each { |x| puts x }" > data_1..10.txt

We can look at the file with the “cat” utility:

$ cat data_1..10.txt 
1
2
3
4
5
6
7
8
9
10

Looks good – let’s make the mapper program. We’ll write it in Ruby without using any external math library. First we’ll write a function that determines whether a number is a prime or not, and then we’ll write a loop that handles one line at a time. We print out all numbers to make the mapper as generic as possible (we might want to combine it with a reduce function interested in the non-primes later on).

#!/usr/bin/env ruby

# try to find evidence of not a prime and return false 
# otherwise return true
def is_prime? n
  return false if n < 2
  (2..(n -1)).each do |d|
    return false if (n / d.to_f) % 1 == 0
  end
  true
end

# read each line and spit out the number and "true" or "false" 
# whether the number is a prime or not, separate the two columns 
# with a comma
ARGF.each_line do |l|
  number = l.to_i
  puts [number,is_prime?(number)].join(',')
end

Make it executable with chmod:

$ chmod +x mapper.rb

Let’s try it out. The “|” redirects the output of “cat” not to a file, but to another program, in this case our mapper program.

$ cat data_1..10.txt | ./mapper.rb 
1,false
2,true
3,true
4,false
5,true
6,false
7,true
8,false
9,false
10,false

Time to write something to compile the result; the reducer. It’ll sum up all the prime numbers and print out the result:

#!/usr/bin/env ruby

prime_sum = 0
ARGF.each_line do |l|
  arr = l.chomp.split(",")
  prime_sum += arr.first.to_i if arr.last == "true"
end

puts "The sum of the primes is #{prime_sum}"

Let’s try out the whole chain by piping everything in a chain:

$ cat data_1..10.txt | ./mapper.rb | ./reducer.rb 
The sum of the primes is 17

Seems to work fine! Let’s generate some more source data and make a speed test. This time we’ll generate several source files just to prepare the distribution of the data for once we go parallel:

$ mkdir src
$ ruby -e "(10000..20000).each { |x| puts x }" > src/10000-20000.txt
$ ruby -e "(20001..30000).each { |x| puts x }" > src/20001-30000.txt

$ time cat src/* | ./mapper.rb | ./reducer.rb
The sum of the primes is 39939468

real	0m19.718s
user	0m19.632s
sys	0m0.070s

So let’s see if we can speed this up a little by running it in parallel, first we’ll need to make a simple bash script to be able to spawn concurrent processes. Here’s the simplest possible script that has some measure of safety. I have 2 cores on this machine so I’ll limit this run to two concurrent processes. If we would spawn too many processes the machine might become overburdened and start processing very slowly.

#!/bin/bash

PARALLEL_JOBS=2

count=0
for item in src/*; do
  cat $item | ./mapper.rb  &
  let count+=1
  [[ $((count%PARALLEL_JOBS)) -eq 0 ]] && wait
done

Let’s try it out:

$ time ./process_parallel.sh | ./reducer.rb 
The sum of the primes is 39939468

real	0m12.582s
user	0m19.779s
sys	0m0.115s

It’s almost twice as speedy! Good improvement. Notice that the “user” time which is time spent by the system is the same, but the “real” time is faster. With more processors you’ll gain more, and it’s pretty easy to just pipe this together in the shell.

A core idea in Unix is to make small utilities that do one thing (really well) and then combine their input and output with pipes. The map/reduce thinking is inherent in unix as we discussed on our upcoming issue #2 of Follow the Data podcast which we’ll soon release.

Want to run this on Hadoop? Since we wrote both the mapper and the reducer so that they work by reading and writing to streams we can just plug these into the Hadoop Streaming API. If you need to develop Hadoop streaming jobs, the process of doing that is pretty much outlined in this tutorial.

If you would be willing to save the data temporarily to disk it would also be possible to use the inherent parallelism support in the “make” utility on a Unix system and write a Makefile. If run with the -j option it processes whatever steps it can in parallel. However the Makefile syntax is kind of hard to read and I think that we would lose the possibility to pipe between a multiple mappers and a single reducer. If you can think of a way to do this with make, please chime in and drop a comment. A good practice when working with processing data is to make it as automatic and repeatable as possible, so I really like trying to make the process of compiling data as similar to compiling programs as possible, since there’s excellent tools developed for keeping the software builds consistent.

Post Navigation

Follow

Get every new post delivered to your Inbox.

Join 46 other followers