Large-scale Entity Extraction, Disambiguation and Linkage
Dr. Flavio Villanustre
Large-scale entity extraction, disambiguation and linkage in Big Data can challenge the traditional methodologies developed over the last three decades. Entity linkage in particular is the cornerstone for a wide spectrum of applications, such as Master Data Management, Data Warehousing, Social Graph Analytics, Fraud Detection, and Identity Management. Traditional rules-based heuristic methods usually don't scale properly, are language-specific and require significant maintenance over time.

We will introduce the audience to the use of probabilistic record linkage (also known as specificity-based linkage) on Big Data, to perform language-independent large-scale entity extraction, resolution and linkage across diverse sources.

The benefit of specificity-based linkage is that it does not use hand-coded user rules. Instead, it determines the relevance/weight of a particular field in the scope of the linking process, and a mathematical model based on the input data, which is key to the overall efficiency of the method.
We will also present a live demonstration reviewing the different steps required during the data-integration process (ingestion, profiling, parsing, cleansing, standardization and normalization), and show the basic concepts behind probabilistic record linkage on a real-world application.
Record linking fits into a general class of data processing known as data integration, which can be defined as the problem of combining information from multiple heterogeneous data sources. Data integration can include data preparation steps such as parsing, profiling, cleansing, normalization, parsing and standardization of the raw input data prior to record linkage to improve the quality of the input data and to make the data more consistent and comparable. (These data preparation steps are sometimes referred to as ETL, or Extract, Transform, Load.)

This class is sponsored by LexisNexis.

Level : Overview