Introduction and Best Practices for Storing and Analyzing Your Data with Apache Hive Has code image
Mark Grover
This tutorial on Apache Hive will introduce Hive and the best practices for storage and data analysis in Hive. Hive is an open-source data warehousing system based on top of Apache Hadoop which lets you query, mine and analyze the data stored in Hadoop clusters using familiar SQL-like queries.

This tutorial will go through a hands-on exercise on how users can use Hive queries to perform data analysis data. Because not all analysis can be expressed using SQL-like queries, the workshop will cover chow to write, test and use User Defined Functions and User Defined Aggregate Functions in Hive. This tutorial will then go through some of the best practices related to partitioning, bucketing and joining various datasets in Hive.

You will also learn how to leverage other technologies in the Hadoop ecosystem, such as plugging in MapReduce scripts from Hadoop directly into their Hive queries, and how to how to integrate HBase with Hive to share the data across the two systems. The tutorial will wrap up with a question-and-answer session.

Note: For this tutorial, you are required to bring in a laptop with Apache Hadoop and Apache Hive installed on it. The best and easiest way to get started is to download a Demo VM with Hadoop and Hive installed and configured on it. You may download such a Demo VM from https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM+for+CDH4 VMWare, KVM and VirtualBox images are available at the above link.

Also, please clone the git repository at https://github.com/markgrover/bdtc-hive *on the demo VM* before you come to the tutorial.

Level : Intermediate