Pattern: A Machine-Learning Library for Cascading, Migrating PMML Models to Hadoop Has code image
Paco Nathan
Pattern is an open-source project that takes models trained in popular analytics frameworks, such as SAS, R, SPSS, MicroStrategy, etc., and runs them at scale on Apache Hadoop. This machine-learning library works by translating PMML—an established XML standard for predictive model markup—into data workflows based on the Cascading API in Java.

PMML models can be run in a pre-defined JAR file with no coding required. PMML can also be combined with other flows based on ANSI SQL (Lingual), Scala (Scalding), Clojure (Cascalog), etc. Multiple companies have collaborated to implement parallelized algorithms: Random Forest, Logistic Regression, SVM, K-Means, Hierarchical Clustering, etc., with more machine-learning support being added. Benefits include greatly reduced development costs and less licensing at scale while leveraging a combination of Apache Hadoop clusters, existing intellectual property in predictive models, and the core competencies of analytics staff.

Sample code in the class will show apps using predictive models built in R for anti-fraud classifiers. In addition, examples will show how to compare variations of models for large-scale customer experiments. Portions of this material come from the book "Enterprise Data Workflows with Cascading."

You will learn how to migrate predictive models to run on Hadoop clusters at scale, how to leverage PMML for customer experiments, and how the notion of "ensembles" has enhanced predictive power: Netflix
Prize, Kaggle, KDD, etc.

Level : Overview