Beyond Map/Reduce
Dean Wampler
Apache Hadoop is the current darling of the Big Data world. At its core is the Map/Reduce computing model for decomposing large data-analysis jobs into smaller tasks, and distributing those tasks around a cluster. Map/Reduce itself was pioneered at Google for indexing the Web and other computations over massive data sets.

The strengths of Map/Reduce are cost-effective scalability and relative maturity. Its weaknesses are its batch orientation, making it unsuitable for real-time event processing, and its difficulty of implementing data-analysis idioms in the Map/Reduce computing model.

We can address the weaknesses in several ways. First, higher-level programming languages, which provide common query and manipulation abstractions, make it easier to implement Map/Reduce programs. However, longer term, we need new distributed computing models that are more flexible for different problems and which provide better real-time performance.

We’ll review these strengths and weaknesses of Map/Reduce and the Hadoop implementation, then discuss several emerging alternatives, such as Cloudera’s Impala system for analytics, Google’s Pregel system for graph processing, and Storm for event processing. We’ll finish with some speculation about the longer-term future of Big Data.

This class is good for developers, data analysts and managers, but people with Hadoop and or programming experience will get the most out of it.

Level : Overview