Simple Yet Efficient Web Extraction with OXPath, Part I Has code image
Tim Furche
Big Data has already changed how we make decisions, whether on pricing, recommendations or investment. However, access to such Big Data is often expensive or limited to large organizations that collect it. Though much is available on the Web, it is often only available through Web Forms and HTML pages.

In this two-part class, we give a thorough overview of large-scale data extraction from the Web, as well as its challenges. The first part gives an overview of existing tools and walks through real-life examples of manual wrappers. In the second part, we delve deeper into data extraction and discuss common patterns and fallacies when creating, maintaining, and running large-scale data-extraction systems.

In the first part, we start with an overview of traditional approaches, outlining their strengths and limitations to enable you to more easily decide what tools are most appropriate for their needs. We walkthrough real-life examples for manual wrapper creation with XPath and WebDriver, the emerging W3C standard for programmatic browser control.

Finally, we show that extracting Big Data from the Web doesn’t have to be hard or costly. We will show you how to extract data with just a little knowledge of XPath. That’s all you need to get started with OXPath, a high-level, high-performance extension of XPath for efficient data extraction from any website. OXPath extends XPath with four well-defined extensions, including the ability to simulate user actions and to select elements of a Web page through their appearance. This allows for easy navigation through complex Web applications, and reduces maintenance in the face of structural page changes.

Level : Intermediate