Fast String Kernels for Discriminative Protein Classification
Wednesday, May 7, 2003
Host: Richard Cole, firstname.lastname@example.org, 212-998-3119
One of the central problems in computational biology is the classification of protein sequences into functional and structural families based on sequence homology. Traditional approaches to this task include alignment-based algorithms and probabilistic models like profile hidden Markov models for protein families. However, these methods are less successful for remote homology detection, where one wants to predict a structural relationship between sequences that have diverged over a long evolutionary distance.
Our approach to protein classification and remote homology detection is to use modern discriminative methods from machine learning, namely support vector machine classifiers (SVMs), combined with kernels specialized for biological sequence data. To this end, we introduce new families of string kernels that incorporate biologically motivated notions of inexact string matching. For example, we define mismatch kernels that measure sequence similarity based on shared occurrences of fixed length subsequences, counted with mismatches. In other models, we use restricted gaps, probabilistic substitutions, or wildcards. We show how to compute these kernels efficiently, so that the kernel evaluation scales linearly with sequence length, and how we can achieve linear time prediction on test sequences when the kernels are used with SVMs. Experimental results on a large benchmark data set show that our kernels, used with SVM classifiers, perform competitively with state-of-the-art methods for remote homology detection while achieving considerable computational savings.