Data Transfer Tools for Hadoop
Sriram Mohan
Hadoop has become the preeminent platform for storage and large-scale processing of data sets on clusters of commodity hardware. Organizations are increasingly seeking to take advantage of the batch processing capabilities of the Hadoop ecosystem for efficiency and direct cost savings. However, these same organizations are wrestling with moving data from their current data stores to Hadoop and back. In this hands-on half-day tutorial, you will be led through the following common use cases of data transfer from data stores to Hadoop:
  • Moving event data and structured data to Hadoop Clusters using Flume: This part will explain the capabilities of Flume and provide examples of using Flume to move event data from Web servers, access logs and other such structured files.
  • Moving relational data to and fro using Sqoop: This part will explain the capabilities of Apache Sqoop, a tool designed for efficient bulk transfer of data between Hadoop and structured data such as relational databases. The session will also highlight features of the upcoming Sqoop 2.0 platform.
  • Moving data from an enterprise NoSQL Database: This part will provide a detailed overview of the capabilities of the MarkLogic Hadoop Connector as: a) Parallel loading from HDFS to MarkLogic, b) Leveraging MarkLogic’s indexes for MapReduce processing, and c) Parallel reads and writes between a MapReduce job and a MarkLogic database. 
  • Moving data from MongoDB: This part will provide a detailed overview of the he MongoDB Connector for Hadoop is a plug-in for Hadoop that provides the ability to use MongoDB as an input source and/or an output destination.
Each use case will be supplemented with feature coding activities and will include best practices supplemented by real-world experience in using these tools in various projects.

Level : Advanced