All sorts of things can go wrong in your data center. Ensuring that your Hadoop systems stay up through various types of threats—from node failures to site failures—is vital toward meeting SLAs and ensuring a high quality of experience for clients of your production-level distributed system. We will discuss the various threat models that need to be handled and the elements of how to build highly available architectures for key system services.
In this class, we will shed light on the complexity of what it takes to keep different system services alive, starting from core Hadoop services of HDFS and Map/Reduce, and extending to higher-level applications such as Hive. This understanding will help you evaluate the risk profile and cost of providing different SLAs for your entire Hadoop system. In this class you will learn about:
- Worker node versus master node failure characteristics and tolerance levels of various services in the Hadoop ecosystem
- Vital components of each service as they pertain to availability (metadata, data, databases, ZooKeeper quorum, etc.)
- What it takes to set up back up nodes versus backup clusters