Getting Started with R and Hadoop, Parts I and II
Jeffrey Breen
Increasingly viewed as the lingua franca of statistics, R is a natural choice for many data scientists seeking to perform Big Data analytics. And with Hadoop Streaming, the formerly Java-only Big Data system is now open to nearly any programming or scripting language. This two-part class will teach you options for working with Hadoop and R before focusing on the RMR package from the RHadoop project. We will cover the basics of downloading and installing RMR, and we will test our installation and demonstrate its use by walking through three examples in depth.

You will learn the basics of applying the Map/Reduce paradigm to your analysis, and how to write mappers, reducers and combiners using R. We will submit jobs to the Hadoop cluster and retrieve results from the HDFS. We will explore the interaction of the Hadoop infrastructure with your code by tracing the input and output data for each step. Examples will include the canonical "word count" example, as well as the analysis of structured data from the airline industry.

No specific prerequisite knowledge is required, but a familiarity with R and Hadoop or Map/Reduce is helpful.

Level : Advanced