Marcin Mejran and Bud Mishra
(1) Stuyvessant High School
(2)Professor of Computer Science & Mathematics, Courant Institute, NYU
(3) Professor, Watson School, CSHL
Abstract
In this abstract, we explore the power of a novel single molecule "optical sequencing" method (Schwartz & Mishra, US Patent, 2001) in the context of a simple application, where a large number of cDNAs on a surface are "sequenced" in parallel and rapidly. The analyses, computational experiments and simulations show that the underlying technology is quite robust in the following sense: Even when the basic steps of "optical sequencing" are assumed to be imperfect and involve only a reasonably small number of "cycles," one can still accurately identify a known RNA (stored in a database) and reliably detect fragments of as-yet-unidentified RNA sequences. The results reported here suggest that the "optical sequencing" method can be successfully used for gene-expression analysis by effectively parallelizing the classical SAGE techniques at a massive scale, while providing many significant advantages and flexibility over the traditional "gene-chips."
Most innovatively, the underlying research develops a mathematical model of this novel single-molecule sequencing device by incorporating realistic error sources, which this device encounters in reading a sequence of nucleotides. These error models include the classical error sources such as insertion, deletion and substitution (misincorporation) as well as the errors caused by failure to read a "run of identical bases" (e.g., poly-A's). This probabilistic model describes the quality of data required from the optical sequencing device in order that it can be used to study gene expression products from a cell and thus create an accurate measurement of the transcriptional state of the cell. We tested and verified this model by experiment using real-life and simulated data. Furthermore, based on the above model, an efficient algorithm is described with pre-processing for querying a genomics database and we checked it against available mRNA database. The results of the tests confirmed the validity of the model and algorithm.
We ran several experiments to verify the theoretical predictions (Theorem 2 in the full paper), using a publicly available nucleotide mRNA database [National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/]. All the experiments were run on a 600 MHz PIII, with 512MB of RAM under Windows 2000 Pro. We converted the above database from FASTA format to MS SQL Server database format and "collapsed" the database. The database consists of 21725 elements and the average length of an element after "collapsing" is 1655. The real data experiment convincingly demonstrated that the collapsed database could be used for practical molecular sequence matching. We have also explored alternative models where an intensive preprocessing step is used to speed up the subsequent steps.
While the current analysis is focused primarily on a single-molecule technology that we are most familiar with, the basic techniques are rather general and apply equally well to other similar approaches: e.g., "polony" sequencing, one proposed by Solexa and the techniques based on measuring single polymerase's activities.