Introduction to Apache Pig, Parts I & II
Daniel Eklund
This two-part class provides an intensive introduction to Pig for data transformations. You will learn how to use Pig to manage data sets in Hadoop clusters, using an easy-to-learn scripting language. The specific topics of the 120-minute class will be calibrated to your needs, but we will generally cover:
  • What is Pig and why would I use it?
  • Understanding the basic concepts of data structures in Pig
  • Understanding the basic language constructs in Pig. We'll also create basic Pig scripts.
Prerequisites: This class will be taught in a Linux environment, using the Hive command-line interface (CLI). Please come prepared with the following:
  • Linux shell experience; the ability to log into Linux servers and use basic Linux shell (bash) commands is required
  • Basic experience connecting to an Amazon EC2/EMR cluster via SSH
  • Windows users should have a knowledge of Cygwin and Putty
  • A basic knowledge of Vi would be helpful but not necessary
Also, bring your laptop with the following software installed in advance:
  • Putty (Windows only): You will log into a remote cluster for this class. Mac OS X and Linux environments include SSH (Secure Shell) support. Windows users will need to install Putty. Download putty.zip from here.
  • A text editor: An editor suitable for editing source code, such as SQL queries. On Windows, WordPad (but not Word) or Notepad++ (but not Notepad) are suitable.
Level : Overview