Expert-Driven Validation of Set-Based Data Mining Results
Author: Gediminas Adomavicius
Advisor: Alexander Tuzhilin
Co-advisor: Ernest Davis

Abstract

This dissertation addresses the problem of dealing with large numbers of set-based patterns, such as association rules and itemsets, discovered by data mining algorithms. Since many discovered patterns may be spurious, irrelevant, or trivial, one of the main problems is how to validate them, e.g., how to separate the ``good'' rules from the ``bad.'' Many researchers have advocated the explicit involvement of a human expert in the validation process. However, scalability becomes an issue when large numbers of patterns are discovered, since the expert cannot perform the validation on a pattern-by-pattern basis in a reasonable period of time. To address this problem, this dissertation describes a new expert-driven approach to set-based pattern validation.

The proposed validation approach is based on validation sequences, i.e., we rely on the expert's ability to iteratively apply various validation operators that can validate multiple patterns at a time, thus making the expert-based validation feasible. We identified the class of scalable set predicates called cardinality predicates and demonstrated how these predicates can be effectively used in the validation process, i.e., as a basis for validation operators. We examined various properties of cardinality predicates, including their expressiveness. We also have developed and implemented the set validation language (SVL) that can be used for manual specification of cardinality predicates by a domain expert. In addition, we have proposed and developed a scalable algorithm for set and rule grouping that can be used to generate cardinality predicates automatically.

The dissertation also explores various theoretical properties of sequences of validation operators and facilitates a better understanding of the validation process. We have also addressed the problem of finding optimal validation sequences and have shown that certain formulations of this problem are NP-complete. In addition, we provided some heuristics for addressing this problem.

Finally, we have tested our rule validation approach on several real-life applications, including personalization and bioinformatics applications.