Practical Natural Language Processing with Hadoop
Dan Rosanova
Hadoop is a natural choice for storing large volumes of unstructured and semi-structured data, but what is done with this data is still largely relational or set based analysis. To gain deep insight from this data, it is often necessary to leverage natural language processing tools to understand the meaning within the data. This class will focus on using the Python Natural Language Toolkit, an open source NLP framework, on top of Hadoop to perform natural language processing on consumer sentiment from social media and email interactions. 

Natural language processing is a complex problem space that promises to enable deep understandings of human actions and decisions. The promise of Big Data is largely predicated on the ability to understand unstructured data quickly. This has proven to be beyond the scope of most solutions. This class will introduce the tools (NLTK, HDP, and Excel) needed to perform real-world NLP as well as strategies for dealing with richness of language like noun phrase chunking and context free grammars. An insurance claims filtering example will be presented from recent research that shows how these tools help companies reduce costs by automating normally manual claims adjudication tasks.

Level : Advanced