Statistical Analysis of Scientific and Other Datasets

Astrophysics and Astrometry

I collaborate with David Hogg and others at NYU on analysis and modeling of astrophysical data as well as on a project to do automated astrometry.
  • We have a paper on modeling the velocities of stars (using Hipparcos data) based only on measurements of their tangential velocities.
  • We have an NSF ITR grant to explore the possibility of doing fully Automated Astrometry.
  • We are looking at using various non-negative machine learning models for learning the basis functions of galaxy spectra and star formation histories.

Bioinformatics

I collaborate with Brendan Frey's group in ECE, with Radford Neal in Statistics and with Tim Hughes' and Andrew Emili's laboratories in Banting and Best on the analysis of biological data, including DNA microarrays, mass-spec/HPLC traces, protein-protein interaction networks, discrete sequences of amino acids or base pairs, etc. This is a new area for me, and it is pretty busy with other machine learning people, but so far it has been fun.

  • Continuous Profile Model Code
    Jennifer Listgarten has posted some code for the CPM model here.
  • ECCB paper on biomarker discovery, Difference detection in LC-MS data for protein biomarker discovery.
    with Jenn Listgarten, Radford Neal and our experimental collaborators Peter Wong and Andrew Emili.
    (See also Jenn's page here for more details and the data from the paper.)
  • Latent Trace Modeling of Simultaneous LC-MS Data
  • Classification of Yeast Mutants based on Microarray Data
    • Our bioinformatics group has collaborated with Tim Hughes' lab in Toronto by providing some results on classifying yeast mutants based on microarray data. This analysis led to identification of several previously uncharacterized yeast proteins involved in the biogenesis of noncoding RNA.
      The paper, A Panoramic View of Noncoding RNA Processing, appeared in volume 113 of Cell, with Wen-Tao Peng as first author. Congratulations to Wen and Tim's lab for an excellent job collecting this data.
  • Other Bioinformatics Interests
    • Andrew Emili has gotten me, Nati Srebro and Ben Marlin interested in simultaneously analyzing both physical (TAP-TAG) interactions and functional (double knockout) interactions between proteins to discover the complexes they form and the functional pathways those complexes are involved in.
    • Tim Hughes, Quaid Morris and others have got me interested in and protein-protein interaction networks. We are looking at data (mostly tap-tag stuff so far) from Tim's lab and others including data obtained during the assembly of the ribosome.
    • Discrete Sequence Alignment
      I've played a bit with Profile HMMs for multiple sequence alignment. Here a rough ascii or ps output of some toy results.
      In these listings, the X's on the top indicate template positions learned by the model, the colon(:) indicates deletion, capital letters indicate insertion, lowercase letters are template matches (or substituions), and the dash(-) is just a null symbol to make the template matches line up nicely in the same column.
      The training data was a set of 60 completely unannotated DNA sequences, which you could recover by removing all colons and dashes and downcasing each row.
    • Analysis of DNA microarrays from mouse organs and yeast mutants.
    • In the future, possbily with my student Jennifer Listgarten, I am interested in looking at SNP data, preferably from humans.

CVS Code Repositories

In collaboration with Greg Wilson and an amazing summer undergrad research assistant Keir Mierle and Kevin Laven we analyzed the data from CVS logs of students working on undergrad CS course assignments. The results show surprisingly little correlation between CVS habbits and performance in the course.
  • See our paper here.

[ | Information | Research | Teaching | Professional | ]

Sam Roweis, Vision, Learning and Graphics Group, NYU, www.cs.nyu.edu/~roweis