Introduction and Best Practices for Storing and Analyzing Your Data with Apache Hive Has code image
Mark Grover
This tutorial on Apache Hive will introduce Hive, as well as the best practices for storage and data analysis in Hive. Hive is an open-source data-warehousing system based on top of Apache Hadoop that lets you query, mine and analyze the data stored in Hadoop clusters using familiar SQL-like queries.

This tutorial will go through a hands-on exercise on how users can use Hive queries to perform data analysis. Because not all analysis can be expressed using SQL-like queries, the workshop will cover how to write, test and use User Defined Functions and User Defined Aggregate Functions in Hive. This tutorial will then go through some of the best practices related to partitioning, bucketing and joining various datasets in Hive.

You will also learn how to leverage other technologies in the Hadoop ecosystem, such as plugging in Map/Reduce scripts from Hadoop directly into their Hive queries, and how to how to integrate HBase with Hive to share the data across the two systems. The tutorial will wrap up with a question-and-answer session.

Note: For this tutorial, you are required to bring in a laptop with Apache Hadoop and Apache Hive installed on it. The best and easiest way to get started is to download a Demo VM with Hadoop and Hive installed and configured on it. You may download such a Demo VM from ccp.cloudera.com/display/SUPPORT/Cloudera+QuickStart+VM. VMware, KVM and VirtualBox images are available at that link as well. Also, please clone the Git repository at github.com/markgrover/bdtc-hive on the demo VM before you come to the tutorial.

Level : Intermediate