Dennis Shasha
Courant Institute of Mathematical Science
New York University
shasha@cs.nyu.edu
Chien-I Liao
Courant Institute of Mathematical Science
New York University
cil217@nyu.edu
Phylogeny is a tool biologists use to represent the evolutionary relationship between different species. Historically, most believe that this evolutionary relationship could be represented using trees. With the advent of genomics, numerous computer algorithms have been used to build phylogenetic trees. However, discrepancies observed in phylogenetic trees lead some (including us) to believe that evolution does not actually occur in a tree-like fashion. Moreover, events such as hybridization or horizontal gene transfer occur in nature and may play an important role in evolution. Trees are incapable of representing such events. Therefore, to represent evolution accurately, we need to use a network representation.
In this research we introduce the Graph of Life to model the history of evolution by allowing interbreeding events (also known as species recombinations). The idea is to examine the DNA sequences in each species, generate the top few most possible phylogenetic trees for each gene family, then pick one from each gene and combine them into a directed acyclic graph. In the final graph, some species may have multiple parents which indicate ancient interbreeding events. The interbreedings, however, should be rare compared to gene mutation. Therefore our goal is to pick trees as similar as possible and combine them with a minimum number of interbreeding events.
* Some previous work can be find here.
The software is not yet published but you may submit your data using the following form. Both input format and parameter explanation could be found in the next section. For questions about the format, to send any feedback to our software or to download the source code, please contact Prof. Dennis Shasha or Chien-I Liao
If you are not familiar with phylogenetics, here is a brief introduction to our problem.
A sample gene tree:
(Calb ((Scas ((Spar Scer Sbay) Smik) Skud) Sklu)) [geneA:457]
Note: we allow missing orthologs in species, i.e., the gene trees need not all share the complete set of species.Parameters are passing by UNIX-like command line flags in this software. It supports two forms: "-x arg" or "--argname=arg". Please refer to the full description of parameters.
Sample commands:
The output graph will be written in the dot language. dot is a free graph drawing program originally developed under UNIX platform. For Unix users, run the following commands:
python graph_of_life.py -d 0 [other optional flags] > filename.dot
dot -Tps filename.dot -o filename.ps
(for .ps output)
dot -Tjpg filename.jpg -o filename.jpg
(for .jpg output)
Note the "-d 0" is necessary for a clean dot output. The default debug level is 1 thus the default output contains extra messages and can't be parsed by the dot program.
For other operating systems, please visit www.graphviz.org to download dot.
Here is very good documentation for the dot language.
The phylogenetic tree is built based on a wide variety of information, including genetic analysis, biochemical analysis and analysis of morphology. It it quite possible that different trees are suggested by different sources. And our goal is to preserve all information available. Therefore instead of choosing one consensus tree and dropping all the others, we combine all trees into a single graph.
We start by introducing some simple notation. The ortholog of gene g is the variant in each species having gene g. That means each species appeared in trees suggested by g has an ortholog of g. Since orthologs in different species are all slightly different, we use subscript strings to identify them. Therefore for this gene tree:
(Calb ((Scas ((Spar Scer Sbay) Smik) Skud) Sklu)) [geneA:457]
The corresponding orthologs in each species:
Calb: {geneA_1}
Scas: {geneA_211}
Spar: {geneA_21211}
Scer: {geneA_21212}
Sbay: {geneA_21213}
Smik: {geneA_2122}
Skud: {geneA_213}
Sklu: {geneA_22}
Generally, a species contains ortholog X_ijk if gene X suggests a tree where this species is the kth child of the jth child of the ith child of the root.
Assumptions:
We modeled the history of evolution using only gene mutation and interbreeding events. Therefore there are 3 assumptions in this study:
Our goal is to output a graph with a minimum number of interbreeding events. If there is a tie, choose the one which minimizes the total tree score/rank. However, it has been shown that even merging two trees into a graph with the fewest interbreeding events is NP-hard[3]. Therefore some good heuristics are needed to solve it.
We have divided our problem into two stages:
We have developed two algorithms: Consensus Method and Simulated Annealing for the tree selecting problem, and we have associated the tree combining stage with a hitting set problem. Details are described in the algorithm page.
All input data was adapted from Rokas et al [1]. Each tree has the same 8 sequenced yeast species.
In this case, only tree combining was performed. The input consists of 9 different gene trees given in [1]. The recombined graph has 5 interbreeding events and 3 hybrid species:
Chien-I Liao