CSC2515, Fall 2005: Some project ideas -------------------------------------- Regression & Classification Algorithms -- In regression/classification, using a Gaussian prior on weights is equivalent (in some sense) to adding noise to the *inputs* of the training data. Compare this approach with the approach of adding noise to the *outputs* of the training data. For example, compare using a quadratic ridge penalty on the weights in logistic regression to another way of regularizing it, namely to add an "epsilon ghost" of each training case, which is a new case with a very small weight, the same inputs and the OPPOSITE output. Or try the analogous thing for regression, by adding Gaussian noise to the outputs. Can you prove anything about this in the linear case? -- Compare the performance of lasso, leaps-and-bounds for variable subset selection (Furnival-Wilson), regular ridge regression (with the ridge penalty chosen using cross validation) and one more regularization method of your choosing on several regression benchmarks. -- Come up with a way to do variable selection when there are too many dimensions to use leaps and bounds efficiently. Try out your new method and compare to lasso and maybe to thresholding the size of weights from logistic regression/LDA/linear least squares. -- Investigate using an absolute value penalty on the weights for logistic regression and/or neural networks for classification. On what kinds of data does this improve/worsen performance relative to an L2 penalty? -- Do a study of how to make decision trees robust to noise in the training data. -- Investigate potential improvements or difficulties caused by preprocessing the inputs to a classification/regression task using PCA, geometric hashing, or other dimensionality reduction techniques. Unsupervised Learning -- Implement mixtures of fully observed trees and compare them to a single fully observed tree on modeling some data, e.g. word counts from documents. -- Investigate a mixture of factor analyzers type model in which there are a total of M factors, each one a vector of length equal to the dimension of the observations. However, for each training case add a "lasso-type" regularization penalty on the coefficients z used to model that training case. This will force the model to set many latent variables to zero when explaining any given data case. Derive the inference rule for this model, and some way of doing learning. Apply it to at least one simple example (e.g. handwritten digits or faces). This is an interesting example of sparsifying factor analysis, which I don't think anyone has looked at before. -- Implement "unobserved naive bayes" which assumes that conditioned on a discrete, unobserved cluster variable all the observed variables are conditionally independent. Apply it to some clustering problems. -- Compare various methods for selecting the number of clusters or mixture components in a discrete clustering setup. You can also look at choosing the number of experts in a conditional mixture of experts or the number of dimensions for dimensionality reduction by continuous latent variable models. Be very careful to define how you will measure success or failure. -- Investigate some methods for learning a distance metric to be used in clustering. As above, be very careful to define how you will measure success or failure. -- Look into "semi-supervised" learning in which you have a lot of unlabelled training data and a small amount labelled for classification/regression. How can the unlabelled data help you? -- Do a study of outlier detection methods. This is in some sense the opposite of clustering: you want to find the examples in your dataset which are in regions of particularly *low* density. Compare some methods on various kinds of data. -- Consider methods which account for missing data, either for classification/regression or for clustering/dimensionality reduction. Can you demonstrate examples where dealing with missing data properly improves your performance on the final task? Optimization/Data Structures: -- Quadratic Programming: Implement either the multiplicative updates described in F. Sha, L. K. Saul, and D. D. Lee (2002). "Multiplicative updates for nonnegative quadratic programming in support vector machines." Technical Report MS-CIS-02-19, University of Pennsylvania or the LARS algorithm described in Efron, B., Johnstone, I., Hastie, T. and Tibshirani, R. (2003). Least angle regression, Annals of Statistics and compare these to at least one other way fast way of solving the QP problem for support vector machines and Lasso (e.g. SMO, or "Simple SVM"). If you want to do more theoretical work you also could derive the extension of the updates in the case where the data is not linearly separable in the feature space. -- Implement a "limited memory BFGS" optimizer, and compare it to gradient descent with momentum and to conjugate gradients on two fronts: (a) the speed at which it minimizes the training error. It should crush gradient descent on speed, and might be a bit faster than CG. (b) the generalization quality of the local minima that it finds. Here you may discover that it is wildly overfitting and loses horribly to gradient descent, maybe losing mildly to CG? -- Investigate the "split-and-merge" heuristic for improving the local minimum found by K-Means and compare it to the furthest first heuristic and to random initializations. -- Investigate the "look-ahead-one-step" heuristic for online optimization of K-means and compare it to the Lloyd Max updates and to soft updates (as in EM for mixtures). -- Investigate speeding up KNN classifications algorithms or Locally Weighted Regression algorithms and reducing their memory load. Compare the speed/memory of using a KD-tree or Ball-tree with using either naive storage/search or using linear projections of the original input data down to a low dimensional subspace. You can compare against random projections, LDA or PCA. Applications -- Think of a way to formulate "de-noising" as an unsupervised, semi-supervised or supervised machine learning problem. Do some experiments with denoising speech and/or images and compare to more traditional signal processing approaches such as Wiener filtering or spectral subtraction. -- Look into clustering as a way of automatically organizing emails or bookmarks into folders. Compare with folders chosen by humans and generated by simple rule-based (e.g. keyword) systems. -- Do vector quantization on a few adjacent columns of spectrograms made from clean speech of various speakers. Train one VQ per person, and use these to do speaker identification by seeing which VQ best reconstructs a test spectrogram. See me if you want some multi-person speech data. -- Train a simple HMM on character sequences from text. Train one HMM per language and use them to do language identification. Compare with a simple Markov Model. -- Train a simple HMM on amino acid sequences from proteins. Train one HMM per protein family and use them to do protein identification. Compare with a simple Markov Model. -- Train an HMM on some symbolic representation of computer network traffic. Train one HMM for normal activity and another one for security breakins and use them to do hacker detection. Compare with a simple Markov Model. -- Train some HMMs on a MIDI representation of music, and use them to classify the genre of music. Compare with a simple Markov Model. -- Use a "profile HMM" to perform multiple alignment on some discrete sequences. Kernel Methods/Boosting -- Look into the use of boosting/bagging/stacking for feature selection. The basic idea here is to make weak models which focus on only one (or a small number of features), treat them with boosting or bagging or stacking and look at which models receive high weight. The features corresponding to those models are the ones which are selected. -- Consider methods for adapting the kernel when training a support vector machine, e.g. learning the variances of a Gaussian kernel. -- Investigate a "kernel lasso" in which there is an L1 penalty on the effective weights in the high dimensional feature space. This is different than an L1 penalty on the weights on the data cases.