Follow the Data

A data driven blog

A little tutorial on mapreduce.

This is a short tutorial to explain the concept of map/reduce. This tutorial can be executed on a Unix system, like Linux or OS X. We’ll first process the data sequentially and then with parallel mapper tasks. As a simple example we will try to compile a list of prime numbers from some text files containing numbers (some prime, some not) and then calculate the sum of all the primes found. Finding primes can be parallelized and is thus on the map side of the algorithm but calculating the sum cannot and is therefore our reduce function. Let’s first start out with creating some test data that is easy to debug, and small, so it’ll run fast. We’ll do this in a terminal shell using ruby. The -e options tells ruby to evaluate the string, and the “>” redirects the output to the filename after.

$ruby -e "(1..10).each { |x| puts x }" > data_1..10.txt

We can look at the file with the “cat” utility:

$ cat data_1..10.txt 
1
2
3
4
5
6
7
8
9
10

Looks good – let’s make the mapper program. We’ll write it in Ruby without using any external math library. First we’ll write a function that determines whether a number is a prime or not, and then we’ll write a loop that handles one line at a time. We print out all numbers to make the mapper as generic as possible (we might want to combine it with a reduce function interested in the non-primes later on).

#!/usr/bin/env ruby

# try to find evidence of not a prime and return false 
# otherwise return true
def is_prime? n
  return false if n < 2
  (2..(n -1)).each do |d|
    return false if (n / d.to_f) % 1 == 0
  end
  true
end

# read each line and spit out the number and "true" or "false" 
# whether the number is a prime or not, separate the two columns 
# with a comma
ARGF.each_line do |l|
  number = l.to_i
  puts [number,is_prime?(number)].join(',')
end

Make it executable with chmod:

$ chmod +x mapper.rb

Let’s try it out. The “|” redirects the output of “cat” not to a file, but to another program, in this case our mapper program.

$ cat data_1..10.txt | ./mapper.rb 
1,false
2,true
3,true
4,false
5,true
6,false
7,true
8,false
9,false
10,false

Time to write something to compile the result; the reducer. It’ll sum up all the prime numbers and print out the result:

#!/usr/bin/env ruby

prime_sum = 0
ARGF.each_line do |l|
  arr = l.chomp.split(",")
  prime_sum += arr.first.to_i if arr.last == "true"
end

puts "The sum of the primes is #{prime_sum}"

Let’s try out the whole chain by piping everything in a chain:

$ cat data_1..10.txt | ./mapper.rb | ./reducer.rb 
The sum of the primes is 17

Seems to work fine! Let’s generate some more source data and make a speed test. This time we’ll generate several source files just to prepare the distribution of the data for once we go parallel:

$ mkdir src
$ ruby -e "(10000..20000).each { |x| puts x }" > src/10000-20000.txt
$ ruby -e "(20001..30000).each { |x| puts x }" > src/20001-30000.txt

$ time cat src/* | ./mapper.rb | ./reducer.rb
The sum of the primes is 39939468

real	0m19.718s
user	0m19.632s
sys	0m0.070s

So let’s see if we can speed this up a little by running it in parallel, first we’ll need to make a simple bash script to be able to spawn concurrent processes. Here’s the simplest possible script that has some measure of safety. I have 2 cores on this machine so I’ll limit this run to two concurrent processes. If we would spawn too many processes the machine might become overburdened and start processing very slowly.

#!/bin/bash

PARALLEL_JOBS=2

count=0
for item in src/*; do
  cat $item | ./mapper.rb  &
  let count+=1
  [[ $((count%PARALLEL_JOBS)) -eq 0 ]] && wait
done

Let’s try it out:

$ time ./process_parallel.sh | ./reducer.rb 
The sum of the primes is 39939468

real	0m12.582s
user	0m19.779s
sys	0m0.115s

It’s almost twice as speedy! Good improvement. Notice that the “user” time which is time spent by the system is the same, but the “real” time is faster. With more processors you’ll gain more, and it’s pretty easy to just pipe this together in the shell.

A core idea in Unix is to make small utilities that do one thing (really well) and then combine their input and output with pipes. The map/reduce thinking is inherent in unix as we discussed on our upcoming issue #2 of Follow the Data podcast which we’ll soon release.

Want to run this on Hadoop? Since we wrote both the mapper and the reducer so that they work by reading and writing to streams we can just plug these into the Hadoop Streaming API. If you need to develop Hadoop streaming jobs, the process of doing that is pretty much outlined in this tutorial.

If you would be willing to save the data temporarily to disk it would also be possible to use the inherent parallelism support in the “make” utility on a Unix system and write a Makefile. If run with the -j option it processes whatever steps it can in parallel. However the Makefile syntax is kind of hard to read and I think that we would lose the possibility to pipe between a multiple mappers and a single reducer. If you can think of a way to do this with make, please chime in and drop a comment. A good practice when working with processing data is to make it as automatic and repeatable as possible, so I really like trying to make the process of compiling data as similar to compiling programs as possible, since there’s excellent tools developed for keeping the software builds consistent.

Advertisements

Single Post Navigation

One thought on “A little tutorial on mapreduce.

  1. Pingback: Reblogging: A little tutorial on mapreduce | The WebGenre Blog: The power of genre applied to digital information. By Marina Santini

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: