MapReduce is a framework for performing structured computations on large amounts of dataโ€”so much data that compute and storage must be split across multiple machines. Thus, we need to solve problems with coordination, load balancing, etc.

For simplicity, lets abstract the data as key-value pairs . MapReduce, as its name suggests, processes the data in two stages:

  1. Map: takes and produces some intermediate pair(s), essentially organizing them into โ€œstacks.โ€
  2. Reduce: takes groups of pairs with the same key and produces some pairs for the output, essentially โ€œaggregatingโ€ the information for the same key โ€˜.

For example, to find the number of occurrences of each word in a set of documents, the map function reads each documents and maps each word to ; then, the reduce function adds the s together for each .

To design a system, we must consider the limitations and responsibility of each function:

  1. Map: only looks at individual key-value pairs, cannot use other key-value pairs. However, it can emit more than one intermediate key-value pair.
  2. Reduce: aggregates multiple values from the same intermediate key.

Implementation

To implement this system, we need:

  1. ๐Ÿ“ฆ Distributed Storage System to store inputs, outputs, and intermediate results. For example, ๐Ÿ”Ž Google File System (GFS).
  2. A driver program (master), which specifies the input and output locations, the mapper and reducer functions, and more. Notably, it should assign tasks to maximize locality, making nodes do work on data replicas itโ€™s already storing.
  3. The runtime system, which controls nodes and supervises execution.