Oozie: A Workflow Scheduler for Hadoop Has code image
Boris Lublinsky
This class will start with a discussion of Oozie’s role in the Apache Hadoop ecosystem and its relationship to other Hadoop platform components. We will then describe the most straightforward use cases/examples where Oozie can be helpful or even necessary. From that we move to Oozie workflow specification: components, illustrated with examples for various Oozie actions (java, map-reduce, hive, pig, sqoop).

Then we will describe Oozie architecture. We will present the main Oozie components, job life cycle, retry and recovery. We will demonstrate the Oozie management console in integration/complementary usage with other Hadoop GUI tools (MapReduce Administration, Task Tracker, NameNode viewer, Fair Scheduler Administration, and Log File View).

We will then discuss Oozie job parameterization, expression language, job configuration, runtime artifacts placement, and job submission using the Oozie command-line utility (CLI). We will demonstrate how to check statuses or stop Oozie jobs from CLI. We will also briefly describe Oozie APIs (java and REST), and provide fragments of code. After that, we will present Oozie coordinator and discuss how Oozie allows expressing dependencies between jobs and groups of jobs using “Synchronous Datasets.” We will touch on Oozie SLA support.

Then we will describe Oozie bundles and demonstrate the whole hierarchy of Oozie artifacts: from bundles to coordinator to workflows to actions to processes running on Hadoop clusters. In conclusion, we will discuss Oozie limitations and ways to overcome some of them (via extension and customization). We will also talk about new features in future versions.

Level : Intermediate