Matrix Methods with Hadoop

David Gleich

Get a brief introduction to thinking about data problems as matrices, and then learn how to implement many of these algorithms in the Hadoop streaming framework. The data-as-matrix paradigm has had a rich history, and the point of this talk is to give folks some idea of which statistical algorithms are likely to be reasonably efficient in MapReduce, and which are probably not going to be so reasonable.

This will involve a few ideas:

• How to storing matrix data and the performance tradeoffs. The idea is to take natural ways of looking at methods to store data and describe them as a way of storing a matrix. This gives some insight into how a method could be fast.

• How to implement some basic matrix operations that form the basis of many numerical and statistical algorithms.

• Problems we've run into working with matrix data, and how we've solved some of them.

• Ideas for future platforms that are ideal for this case

• A concern about the numerical accuracy for Big Data. Most of the code samples will use Python interfaces to Hadoop streaming, such as Dumbo, mrjob or Hadoopy.

Prerequisites:

• You ought to have a rough handle on what a matrix represents.

• Those with a linear algebra background will probably get more out of this talk, but I'll explain any linear algebra property with a statistical or data-oriented analog.

• You should know about how to use MapReduce or Hadoop, most of the code examples will use Hadoop streaming via Dumbo. Example slides: www.slideshare.net/dgleich/mapreduce-for-scientific-simulation-analysis

Example code: github.com/dgleich/mrmatrix

Level : Advanced