Optical Map Based Sequence Validation of Microbes

Marco Antoniotti, Thomas Ananthraman, Violet Chang, David Schwartz and Bud Mishra

Abstract

The research activity of NYU bioinformatics groups is centered on the algorithms for mapping and sequencing projects and has been focused on the specification languages, environments and systems for bioinformatics. Over the last eight months, we have used the bioinformatics system to develop a suite of mathematical models and associated algorithms for the Validation, Alignment, and Restriction Fragments Translocation Detection and Correction using publicly available optical mapping data and related sequence data. These problems are ideal for testing out the suitability of our software system as it addresses new challenges posed by the availability of vast amount of biological data, especially in the form of DNA sequences. In particular, the Ordered Restriction Mapping bioinformatics manipulation tools, we have developed, serve the dual purpose of improving the DNA sequencing efforts and providing new analysis capabilities that can be derived from the maps themselves directly.

The Validation algorithm we have developed for Ordered Restriction Maps serves to establish the quality of an assembled DNA sequence by comparing it with to an Ordered Restriction Map (e.g. a map obtained via the Optical Mapping Process). The core procedure of the Validation algorithm takes a DNA sequence (retrievable from a variety of sources) and an Ordered Restriction Map (called the "consensus" map). From the DNA sequence, an "in silico" ordered restriction map is obtained (called the "sequence" map). The sequence map is aligned against the original map using a sophisticated dynamic programming formulation. The procedure constructs several alignments of the sequence map against the original map. Each such alignment is ranked according to the value of the computation of a Maximum Likelihood Estimate of a statistical model that takes into account several error sources of the underlying biochemical process. For the Optical Mapping process the error sources considered are

The (Multiple) Alignment algorithm takes as input an Ordered Restriction (Consensus) Map and a set of (small) sequence contigs. Each sequence contig is run through the Validation subsystem and its possible alignments are computed and set aside for future post-processing. Subsequently the Alignment algorithm constructs a putative anchoring of every contig on the consensus map, by selecting one alignment per contig. The selection procedure represents a trade-off among the following criteria Objective 2 and 3 are contradictory; hence we developed a Lagrangian-like approximation scheme that weighs one or the other criteria under user control. The first criteria may be relaxed in order to look for overlaps that were possibly missed by the contig sequencing and assembly procedure. As our preliminary analysis led us to believe that the problem of construction is likely to be computationally infeasible, we developed two procedures that approximate its construction. The first is a simple Greedy approach that has worked well in our experiments, and the second is an iterative 1D Dynamic Programming procedure that minimizes a weighted cost function subject the constraints 1 to 3 above.

The Restriction Fragment Translocation (RFT) Detection and Correction algorithm reuses the basic Validation algorithm by considering a consensus map, a sequence map and all the sub-sequences of the sequence map. The validation algorithm is run on the N^2 sub-maps of the sequence map, in order to determine whether any of the sub-maps can be anchored in a different position than the one assigned by the sequence map alignment. The result of the RFT algorithm is an ordered set of rearrangements of the sequence sub-maps.

The three Ordered Restriction Maps algorithms (Validation, Alignment, and RFT Detection/Correction) do not work in isolation. While we developed the mathematical and statistical models that constitute the core of the three algorithms, we also developed a software infrastructure integrating the components and based on a DataBase. To achieve the software integration, we developed the specification of file exchange formats and several auxiliary programs used in a variety of ways (e.g. a sequences and maps "simulator" which can be used to generate in silico sequences and maps of various complexity and structure). The three algorithms produce large data sets. In order to navigate the data sets in a more interactive way, we developed two specialized viewers called "CONVex" and "genscape." "genscape" is an evolution of "CONVex." The viewers interact with the underlying infrastructure. The main idea behind the two viewers is to provide a zoomable view of the set of alignment, and to enable the user to inspect the displayed maps (consensus and sequence) at a fine detail level. The viewers are also available as libraries and have been integrated in the VALIS system. All of this software infrastructure has been made available publicly through the Internet and has also been made specifically available to University Wisconsin.

The three algorithms have been tested in a variety of ways. In particular we concentrated on analyzing the P. falciparum parasite (the Malaria agent): an organism for which there are both published sequences and published Ordered (Optical) Restriction Maps. We downloaded the known sequences of the P. falciparum parasite's 14 Chromosomes from the PlasmoDB online database (www.plasmodb.org). Only Chromosome 2 and 3 have been fully assembled so far. For the remaining 12 we only have sets of contigs, for which no known position along the respective chromosome is published. We ran the Validation algorithm on the P. falciparum chromosome 2 and 3 sequence data, against the Ordered (Optical) Restriction Maps. The results show very good agreement between the consensus map and sequence map. Hence we can conclude that both consensus and sequence maps are correct. For the remaining 12 P. falciparum chromosomes, we ran the Alignment algorithm and we were able to propose anchoring positions for all the contigs along the respective consensus maps. All these results are viewable at http://bioinformatics.cat.nyu.edu/valis under the "Projects" link.