Computer Science Department

Computer Science Colloquium

Fast String Kernels for Discriminative Protein Classification

Christina Leslie
Columbia University

Wednesday, May 7, 2003
11:00 a.m.
Room 1302 WWH
251 Mercer Street
New York, NY 10012-1185

Host: Richard Cole,, 212-998-3119
Colloquium Information:


One of the central problems in computational biology is the classification of protein sequences into functional and structural families based on sequence homology. Traditional approaches to this task include alignment-based algorithms and probabilistic models like profile hidden Markov models for protein families. However, these methods are less successful for remote homology detection, where one wants to predict a structural relationship between sequences that have diverged over a long evolutionary distance.

Our approach to protein classification and remote homology detection is to use modern discriminative methods from machine learning, namely support vector machine classifiers (SVMs), combined with kernels specialized for biological sequence data. To this end, we introduce new families of string kernels that incorporate biologically motivated notions of inexact string matching. For example, we define mismatch kernels that measure sequence similarity based on shared occurrences of fixed length subsequences, counted with mismatches. In other models, we use restricted gaps, probabilistic substitutions, or wildcards. We show how to compute these kernels efficiently, so that the kernel evaluation scales linearly with sequence length, and how we can achieve linear time prediction on test sequences when the kernels are used with SVMs. Experimental results on a large benchmark data set show that our kernels, used with SVM classifiers, perform competitively with state-of-the-art methods for remote homology detection while achieving considerable computational savings.