Candidate: Gideon Berger
Advisor: Bud Mishra
Knowledge Discovery in Databases for Intrusion Detection, Disease Classification and Beyond
4:15 p.m., Monday, April 23, 2001
12th floor conference room, 719 Broadway

As the number of networked computers grows and the amount of sensitive information available on them grows as well there is an increasing need to ensure the security of these systems. The security of computer networks is not a new issue. We have dealt with the need for security for a long time with such measures as passwords and encryption. These will always provide an important initial line of defense. However, given a clever and malicious individual these defenses can often be circumvented. Intrusion detection is therefore needed as another way to protect computer systems. This thesis describes a novel three stage algorithm for building classification models in the presence of non-stationary, temporal, high dimensional data, in general, and for detecting network intrusion detections, in particular. Given a set of training data records the algorithm begins by identifying "interesting'' temporal patterns in this data using a modal logic. This approach is distinguished from other work in this area where frequent patterns are identified. We show that when frequency is replaced by our measure of "interestingness'' the problem of finding temporal patterns is NP-complete. We then offer an efficient heuristic approach that has proven effective in experiments. Having identified interesting patterns, these patterns then become the predictor variables in the construction of a Multivariate Adaptive Regression Splines (MARS) model. This approach will be justified, after surveying other methods for solving the classification problem, by its ability to capture complex nonlinear relationships between the predictor and response variables which is comparable to other heuristic approaches such as neural networks and classification trees, while offering improved computational properties such as rapid convergence and interpret-ability. After considering a variety of approaches to the problems of over-fitting which is inherent when modeling high dimensional data and non-stationarity, we describe our approach to addressing these issues through the use of truncated Stein shrinkage. This approach is motivated by showing the inadmissibility of the maximum likelihood estimator (MLE) in the high dimensional (dimension >= 3) data. We then discuss the application of our approach as participants in the 1999 DARPA Intrusion Detection Evaluation where we were able to exhibit the benefits of our approach. Finally, we suggest another area of research where we believe that our work would meet with similar success, namely, the area of disease classification.