Yi Zhou, Archisman Rudra, Salvatore Paxia and Bud Mishra
Abstract
Modern biology is driven by large scale processing of heterogeneous data, which may come from diverse sources. This could be anything from a Genbank sequence to the result of some microarray experiment. The interfaces that let one access these different sources vary widely, so that a biologist needs to be an expert in very different areas of computer science: databases, networking, languages etc. Furthermore, the algorithms used to extract biologically significant information tend to be developed in an ad hoc manner. This leads to very little code sharing between the data analysis algorithms, with the concomitant increase in code complexity. Instead of developing each tool ab initio, our bioinformatic system VALIS defines low level building blocks and uniform APIs which let one use these from high level scripting languages. This enables biologists to write very simple scripts to perform fairly involved bioinformatic processing in a flexible fashion.
As an example we use the VALIS system to investigate DNA evolution properties. In particular, we measure the long-range correlation (LRC) of DNA sequences. Our approach is to estimate the degree of DNA sequence LRC in the hope of distinguishing between different models of DNA evolution.
In order to study DNA sequence LRC, we view the DNA sequences as being generated from a random walk model. We first map a whole genomic sequence using a purine-pyrimidine binary rule: change purines (A/G) to +1 and pyrimidines (C/T) to -1. This creates a `DNA walk' along the genome. If there is no LRC, the walk is a realization of a Brownian motion. Otherwise, we observe a `walker' with long-term memory and thus a Fractional Brownian motion. Those two processes can be characterized by different values of the Hurst exponent (H), which can be estimated using different methods (for example, R/S analysis and de-trended fluctuation analysis). For infinite length, H=0.5 for Brownian motion, and 0<H<1, H != 0.5 for Fractional Brownian motion. Higher H (>0.5) suggests the presence of positive feedback mechanism, whilst a lower H (<0.5) suggests negative feedback mechanism.
We have analyzed various genomes using VALIS: bacteria, invertebrate and vertebrate. We observe a consistently lower H value in the coding regions compared to the non-coding regions. Thus, the DNA walk down the coding region sequences behaves closer to a Brownian motion (H ~ 0.5), while it acts more like a Fractional Brownian motion in the non-coding regions (H>0.5). The higher H values in non-coding regions indicate that the sequences in the non-coding regions possess stronger positive LRC than those in the coding regions. In addition, the H values in different regions increase with the evolutionary positions of the corresponding organisms. This suggests that there are some cellular events that tend to make DNA sequences more positively correlated as evolution proceeds.
Based on our observations, we hypothesize that the differences in the strengths of LRC in DNA sequences are caused by the counteraction of two sets of biological events. One set includes insertion and deletion events caused by DNA polymerase stuttering, transposons and recombination, which tend to increase DNA long-range correlation. The other set includes natural selection and DNA repair mechanisms, which try to eliminate the LRC caused by the former events. However, the coding regions also possess the transcription-coupled DNA repair mechanism that is unique to them, and are under a higher natural selection pressure. The cellular processes mentioned above form a spectrum of events impacting DNA evolution. The differences in the distribution of such spectrum in coding and non-coding regions and in different organisms cause the differences in the degree of LRC in DNA sequences. The hypothesis can be tested 'in silico' using simulation following Polya’s Urn model, and verified 'in vivo' using microbial mutants in the corresponding cellular processes.
The 'in silico' evolution simulation can be realized using our simulation tool `Genome Grammar'. `Genome Grammar' is a stochastic grammar with primitives for many kinds of mathematical probability distributions. We can even generate a sequence with the same probability distribution as measured from biological data. Furthermore, there are tools that let one apply some hypothesized processes on sequences obtained from the grammar. This enables us to test any model and conduct evolutionary experiments 'in silico'.