Cassandra + S3 + Hadoop = Quick Auditing and Analytics
The Cassandra database is an excellent choice when you need scalability and high availability without compromising performance. Cassandra’s linear scalability, proven fault tolerance and tunable consistency, combined with its being optimized for write traffic, make it an attractive choice for performing structured logging of application and transactional events. But using a columnar store like Cassandra for analytical needs poses its own problems, problems we solved by careful construction of Column Families combined with diplomatic use of Hadoop.
Our system needed to support both a high volume of structured, distributed writes as well as broad analytical capabilities. Unlike SQL databases, Cassandra does not support ad hoc queries, and data typically needs to be properly structured and denormalized at write time. At the same time, decisions need to be made depending on how often the data is queried, how stale the data can be, and the allowable latency before results are returned. Our system handles these different use cases by delegating certain reporting tasks to Hadoop while keeping some in Cassandra itself.
This tutorial focuses on building a similar system from scratch, showing how to perform analytical queries in near real time and still getting the benefits of the high-performance database engine of Cassandra. The key subjects are:
• The splendors and miseries of NoSQL
• Apache Cassandra use cases
• Difficulties of using Map/Reduce directly in Cassandra
• Amazon cloud solutions: Elastic MapReduce and S3
• “Real-enough” time analysis
In particular, the tutorial dives into ways of handling different kinds of semi-ad hoc queries when using Cassandra, as well as the pitfalls in designing a schema around a specific analytics use case. Some attention will be paid to dealing with time-series data in particular, which can present a real problem when using Column-Family or Key-Value store databases.
Level : Advanced