Cascading Tutorial Has code image
Paco Nathan
The tutorial begins with a quick pre-flight check: Set up and test your environment, choosing to use either laptop or cloud. We'll cover a brief history of Cascading and related open-source projects (Cascalog, Scalding, etc.), plus an overview of typical use cases. Then we'll build and run the simplest-possible Cascading app, using it to discuss definitions of the most commonly used components of data pipelines.

We'll explore some of the theory which supports the use of abstraction layers for Hadoop: deterministic vs. non-deterministic query planners, aspects of functional programming, pattern language, literate programming, and the software engineering considerations of Hadoop system integration, operationalizing apps, and design patterns for bringing Enterprise teams together.

Then we'll work through a progression of sample apps, each building upon the last to show more sophisticated pipelines and explore more components of Cascading (Word Count, Customized Operation, Joins at scale), along with comparisons to similar constructs in Hive and Pig. We'll summarize with a full implementation of TF-IDF (search index) in Cascading, and show how to instrument and test the app.

Branching out into other languages, we will compare Word Count also in Cascalog and Scalding, then work through examples using ANSI SQL (Lingual) and PMML (Pattern). We'll conclude by reviewing a case study: Using Cascalog on Open Data from the City of Palo Alto.

Prerequisites: Bash command line, some programming in Java, plus familiarity with Git/GitHub.

Note: This class is part lecture and part hands-on; you are required to bring a laptop.

Level : Intermediate