Statistical Analysis of Scientific and Other Datasets
Astrophysics and Astrometry
I collaborate with
David Hogg and others at
NYU on analysis and modeling of astrophysical data as well as on a
project to do automated astrometry.
- We have a paper
on modeling the velocities of stars (using
Hipparcos data) based only on measurements of their tangential velocities.
- We have an NSF ITR grant to explore the possibility of doing fully
- We are looking at using various non-negative machine learning
models for learning the basis functions of galaxy spectra and star
I collaborate with Brendan Frey's group in ECE, with Radford Neal in
Statistics and with Tim Hughes' and Andrew Emili's laboratories in
Banting and Best on the analysis of biological data, including DNA
microarrays, mass-spec/HPLC traces, protein-protein interaction
networks, discrete sequences of amino acids or base pairs, etc. This
is a new area for me, and it is pretty busy with other machine
learning people, but so far it has been fun.
- Continuous Profile Model Code
Jennifer Listgarten has posted some code
for the CPM model here.
- ECCB paper on biomarker discovery,
Difference detection in LC-MS data for protein biomarker
with Jenn Listgarten, Radford Neal
and our experimental collaborators Peter Wong and Andrew Emili.
(See also Jenn's page
here for more
details and the data from the paper.)
- Latent Trace Modeling of Simultaneous LC-MS Data
- Classification of Yeast Mutants based on Microarray Data
- Our bioinformatics group has
collaborated with Tim Hughes' lab in Toronto
by providing some results on classifying yeast mutants
based on microarray data. This analysis
led to identification of several previously uncharacterized
yeast proteins involved in the biogenesis of noncoding RNA.
A Panoramic View of Noncoding RNA Processing,
appeared in volume 113 of Cell, with Wen-Tao Peng as first author.
Congratulations to Wen and Tim's lab for an excellent job collecting
- Other Bioinformatics Interests
- Andrew Emili has gotten me, Nati Srebro and Ben Marlin interested
in simultaneously analyzing both physical (TAP-TAG) interactions and
functional (double knockout) interactions between proteins to discover
the complexes they form and the functional pathways those complexes
are involved in.
- Tim Hughes, Quaid Morris and others have got me interested in
and protein-protein interaction networks. We are looking at data
(mostly tap-tag stuff so far) from Tim's lab and others including
data obtained during the assembly of the ribosome.
- Discrete Sequence Alignment
I've played a bit with Profile HMMs for multiple sequence
alignment. Here a rough ascii or
ps output of some toy results.
In these listings, the X's on the top indicate template positions
learned by the model, the colon(:) indicates deletion,
capital letters indicate insertion, lowercase letters are template
matches (or substituions), and the dash(-) is just a null symbol
to make the template matches line up nicely in the same column.
The training data was a set of 60 completely unannotated DNA
sequences, which you could recover by removing all colons and dashes
and downcasing each row.
- Analysis of DNA microarrays from mouse organs and yeast
- In the future, possbily with my student Jennifer Listgarten,
I am interested in looking at SNP data, preferably from humans.
CVS Code Repositories
In collaboration with Greg Wilson and an amazing summer undergrad
research assistant Keir Mierle and Kevin Laven we analyzed the
data from CVS logs of students working on undergrad CS course
assignments. The results show surprisingly little correlation between
CVS habbits and performance in the course.
Sam Roweis, Vision, Learning and Graphics Group,