Yi Zhou, Archisman Rudra, Salvatore Paxia and Bud Mishra
Abstract
Modern biology is driven by large scale processing of heterogeneous data, which may come from diverse sources. This could be anything from a Genbank sequence to the result of some microarray experiment. The interfaces which let one access these different sources vary widely, so much so that a biologist needs to be an expert in very different areas of computer science: databases, networking, languages etc. Furthermore, the algorithms used to extract biologically significant information tend to be developed in an ad hoc manner. This leads to very little code sharing between the data analysis algorithms with the concomitant increase in code complexity.
Instead of developing each tool ab initio, our bioinformatics system VALIS defines low level building blocks and uniform APIs which lets one use these from high level scripting languages. This enables biologists to write very simple scripts to perform fairly involved bioinformatics processing in a flexible fashion.
As an example we use the VALIS system to investigate the consequences of various cellular events on genomic DNA sequence evolution. How genomes evolve is a very important problem in biology. It will lead to better understanding on the mechanisms of cancer development, and more accurate analyses of phylogeny data.
We approach the study of sequence evolution by looking at statistical properties of the DNA sequences. In particular, we measure the long-range correlation properties of DNA sequences. Our approach is to estimate a few of these statistical parameters in the hope of distinguishing between different models of DNA evolution in coding and non-coding regions.
In order to study the scale-invariant long-range correlation of the DNA sequences, we view the DNA sequences as being generated from a random walk model. We first map the whole genomic DNA sequences following purine-pyrimidine binary rule: change purines (A/G) to +1 and pyrimidines (C/T) to -1. This creates a `DNA walk' along the genome. The `DNA walker' moves either up or down at every base pair according to the binary map of the DNA sequence. If there is no long-range correlation, the walk is a realization of a Brownian motion. Otherwise, we observe a `walker' with long-term memory and thus a Fractional Brownian motion. Those two processes can be characterized by different values of the Hurst exponent (H). H=0.5 for Brownian motion and H>0.5 for Fractional Brownian motion, i.e. higher H values suggests the presence of stronger long-range correlation. We use many different methods to estimate H, for example, R/S analysis and detrended fluctuation analysis (DFA).
We have analyzed various genomes using VALIS: bacteria, invertebrate and vertebrate. We observe a consistent difference in H in the coding regions compared to the non-coding regions. The H values tend to be higher in the non-coding regions than in the coding regions. Thus, the DNA walk down the bacterial coding region sequences behaves as a Brownian motion (H ~ 0.5), while it acts as a Fractional Brownian motion in the non-coding regions (H>0.5). For other organisms, such as yeast, the difference persists: yeast has H ~ 0.54 in the coding regions, versus H ~ 0.61 in the non-coding regions. The higher H values in non-coding regions indicate that the sequences in the non-coding regions possess much stronger long-range correlation than those in the coding regions. In addition, the H values in different regions increase with the evolutionary position of the corresponding organism. This suggests that there are some cellular events that tend to make DNA sequences more correlated as evolution proceeds.
Based on our observations, we hypothesize that the differences in the strengths of long-range correlation in DNA sequences are caused by the counteraction of two sets of biological events. One set includes insertion, deletion events caused by DNA polymerase stuttering and transposons, which tend to increase DNA long-range correlation. And the other set includes natural selection and DNA repair mechanisms, which try to eliminate the long-range correlation caused by the former events. However, the coding regions are under a higher natural selection pressure and possess the transcription-coupled DNA repair mechanism that is unique to them. Thus, the stronger correlation-elimination forces in the coding regions can explain the weaker long-range correlation observed there than that in the non-coding regions. And the higher flexibility offered by larger genome sizes in the higher organisms allows the increase of long-range correlation in DNA sequences along the evolution tree.
To test our hypothesis, we designed a `Genome Grammar'. This is a stochastic grammar with primitives for many kinds of mathematical probability distributions. We can even generate a sequence with the same probability distribution as measured from biological data. Furthermore, there are tools that let one apply some hypothesized processes act on sequences obtained from the grammar. This enables biologists to apply any model and conduct evolutionary experiments 'in silico'.
Our observations also have potential significance for biotechnology application. Taking the advantage of highly efficient statistical algorithms in VALIS, the discovery of statistical differences in DNA coding and non-coding regions may lead to potential in vitro biochemistry technologies that can efficiently detect coding and non-coding regions without the effort of DNA sequencing.