What You Can Do with a Tall-and-Skinny QR Factorization on Hadoop
A common Big Data style dataset is one with a large number (millions to billions) of samples with up to a few thousand features. One way to view these datasets from an algorithmic perspective is to treat them as a tall-and-skinny matrix.
In this class you will learn one of the most widely used matrix decomposition techniques: the QR factorization.
First, you’ll see how to use the QR factorization to solve various statistical analysis problems on datasets with many samples. For instance, we'll see how to compute a linear regression for such a problem, how to compute principal components, and how easily we can implement a new scalable Kernel K-Means clustering method.
Next, you will learn how to compute these QR factorizations in Hadoop. We use an algorithm that maps beautifully onto the MapReduce distributed computational engine. It boils down to computing independent QR factorizations in each mapper and reducer stage. One challenge we faced was in improving the numerical stability of the routine in order to get a good estimate of the orthogonal matrix factor, which is most important for applications that need a tall-and-skinny matrix singular value decomposition.
• You ought to have a rough handle on what a matrix represents.
• Those with a linear algebra background will probably get more out of this talk, but I'll motivate any linear algebra property with a statistical or data-oriented analog.
• You should know about how to use MapReduce or Hadoop; most of the code examples will use Hadoop streaming via Dumbo in Python.
• If you attend my other talk "Matrix
Methods with Hadoop," you'll get more out of this class, but it isn't required.
• If you are familiar with linear regression, principal components or Kernel matrices, you'll get more about of this class.
Example slides: www.slideshare.net/dgleich/tallandskinny-qr-factorizations-in-mapreduce-architectures
Example code: github.com/dgleich/mrtsqr
Level : Advanced