MapReduce

October 6, 2009

MapReduce is a programming idiom that provides a convenient expression for programs that combine like items into equivalence classes. The idiom was developed by Google as a way to exploit large clusters of computers operating in parallel on large bodies of data, but is also useful as a way of structuring certain types of programs. Jeffrey Dean and Sanjay Ghemawat, in their paper MapReduce: Simplified Data Processing on Large Clusters, describe the idiom:

Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

Google uses MapReduce to automatically parallelize computations across the large sets of machines at their data centers, gracefully “partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication.” Our aspirations (and our budget) are more modest: build a framework for exploiting the mapreduce idiom in out day-to-day programs. Consider the following examples:

  • Count the frequencies of letters in a string or words in a text. The mapper associates the value 1 with each character or word as the key, and the reducer is simply the addition operator, adding all the 1s to count the words.
  • Produce a cross-reference listing of a program source text. The mapper associates each identifier with the line number where it appears, and the reducer collects the line numbers for each identifier, discarding duplicates.
  • Identify anagrams in a word list. The mapper “signs” each word by sorting its characters into alphabetical order, and the reducer brings together words with common signatures.

The map-reduce function takes four parameters: the mapping function, the reducing function, a less-than predicate that operates on keys, and the input list. The mapping function takes an item from the input list and returns a key/value pair, and the reducing function takes a key, a new value and an existing value and merges the new value into the existing value. A useful variant of the map-reduce function reads input from a file instead of a list; it replaces the input list parameter with a filename and adds a fifth parameter, a reading function that fetches the next input item from the file.

Your task is to write the map-reduce and map-reduce-input functions. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.

Advertisement

Pages: 1 2