A Probabilistic Learning Approach to Attribute Value Inconsistency Resolution
Candidate: Ilya Pevzner
Advisor: Arthur Goldberg

Abstract

Resolving inconsistencies in data is a problem of critical practical importance. Inconsistent data arises whenever an attribute takes on multiple, inconsistent, values. This may occur when a particular entity is stored multiple times in one database, or in multiple databases that are combined.

We investigate Attribute Value Inconsistency Resolution (AVIR), the problem of semi-automatically resolving data inconsistencies among multiple database records that describe the same person or thing.

Our survey of the area shows that existing solutions are either limited in scope or impose a significant burden on their users. Either they do not cover all types of inconsistencies and attributes, or they require users to write or choose attribute resolution functions for each potentially conflicting attribute.

Our ML based approach applies to all types of inconsistencies and attributes, and automatically selects appropriate resolution functions based on the conflicting data. We have invented and developed a system, that uses a set of binary features that detect data properties and relationships and resolution functions that merge data. Many such features and resolution functions have been written. The system uses supervised learning with maximum likelihood estimation to determine which function(s) to apply, based on which feature(s) fire.

We have validated our system by comparing its error rate, decision rate and decision accuracy on a test data set to baseline values determined by a clairvoyant application of a standard approach where each potentially conflicting attribute is resolved by the best resolution function for the attribute.