The Graph of Life

Chandni Rajan Valiathan
Howard Hughes Honors Summer Institute
Summer 2004
chandni@brandeis.edu

Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu

Contents:

Introduction
Problems with trees
Phylogenetic Networks
Project Goals
Algorithm 1
Algorithm 2
Results
References

Introduction:

Phylogeny is a tool biologists use to represent the evolutionary relationship between different species. Historically, most believed that this evolutionary relationship could be represented using trees. With the advent of genomics, numerous computer algorithms have been used to build phylogenetic trees. However discrepancies observed in phylogenetic trees lead some (including us) to believe that evolution does not actually occur in a tree-like fashion. Moreover, events such as hybridization or horizontal gene transfer occur in nature and play an important role in evolution. Trees are incapable of representing such events. Therefore, to represent evolution accurately, we need to use a network representation.

Back to top

Problems with trees:

One reason to think that trees do not give an accurate representation of evolution is that different gene trees for a particular set of species vary. The reasons for these discrepencies could be that:

The input data was erroneous
Reticulation events such as horizontal gene transfer and hybridization has taken place
Convergent evolution has taken place

Phylogenetic Networks:

Reticulation events can be represented using Directed Acyclic Graphs (DAGs) of a certain form. There are two kinds of nodes in these special DAGs: tree nodes and reticulation nodes. Tree nodes have one edge entering them and two or more edges leaving them. Reticulation nodes have many edges entering them and one edge leaving them. Reticulation nodes give phylogenetic networks the ability to depict reticulation events.
At present there is no universally accepted network reconstruction algorithm.

Back to top

Project Goals

The goals of my project were to study various proposed algorithms and to compare these algorithms and their results.
Back to top

Algorithm 1:Galled network reconstruction (Nakhleh et. al)

Definitions:

Reticulation Cycle: Suppose there is a node w in a network from which two directed paths come out, and that the two paths meet at a node x. This node x is called a reticulation node, and the 2 directed paths (that start at w and end at x) together form a reticulation cycle.
Gall: A reticulation cycle that shares no nodes with any other reticulation cycle is known as a gall
Galled Network: A network in which all the reticulation cycles are galls is called a galled network
Partition: Every edge e in a tree has two sets of nodes associated with it: the first set contains all those nodes below the edge and the second set contain all the remaining nodes. These two sets of nodes together constitute a partition of e
Incompatible Edges: An edge e1 in tree T1 is said to be incompatible with tree T2 if there are no edges in tree T2 with the same partition as e1
Ends of a path: These are the subtrees attached to the first or last nodes of a path.

See the example to see these relationships. In the diagram, there are two reticulation cycles: one made up of the edges (1,2)(2,4)(1,3)(3,4) and the second made up of the edges (5,8)(8,10)(5,9)(9,10). Both of these are galls since none of the nodes in the first reticulation cycle are in the second and vicce versa. Since there are only two reticulation cycles in the graph, and both of them are galls, the network in the example is a galled network.
The partition of the dashed edge (3, 6) is {{g,h},{1,2,4,5,7,8,9,10,11,a,b,c,d,e,f}}

Basic concept:

Given two different gene trees T1 and T2, we first find the incompatible edges in the two trees.
For each path p made up of the incompatible edges, we find the subtree X in one of the ends of p such that when X is removed from both trees T1 and T2, we get a common tree T3.
Once you find such a subtree X which is always attached to one of the ends of p, we can build the network by adding a new edge from the other end of p to the root of X.

Assumptions:

An ortholog of a gene can only appear once in a gene tree (i.e there is no convergence)
The true Phylogenetic nework is a galled network

Reasons given to justify assumption 2

True evolutionary history is likely to be a galled network if:

there are moderate levels of recombination
most recombination events are recent
we are studying a single recombination event

To download this algorithm, please contact chandni@brandeis.edu
Back to top

Algorithm 2: Ancestral Network Reconstruction (Dennis Shasha)

This algorithm takes as input a set of different gene trees for a particular set of species and constructs a network. Notation: If A is a gene then A_x is a variant or ortholog of A, and A_xy is a descendant of A_x.
Suppose the gene tree for a set of species {Sp1, Sp2, Sp3} for gene A is (A1, (A21, A22)) where
Sp1:A₁
Sp2:A₂₁
Sp3:A₂₂
Suppose also that the gene tree for the same set of species for gene B is (B₁₁, (B₂₁,B₂₂)) where
Sp1: B₂₂
Sp2:B₂₁
Sp3:B₁₁
Combining this data we get that
Sp1:A₁B₂₂
Sp2:A₂₁B₂₁
Sp3:A₂₂B₁₁
We can already see a conflict in that Sp1 and Sp2 must have an ancestor AB₂ (which means that gene B must have changed before gene A) whereas Sp2 and Sp3 must have an ancestor A₂B (which means that gene A must have changed before B). This algorithm constructs the maximal base tree containing as many nodes as possible that satisfy tree conditions from the given data. Once this base tree is obtained, the algorithm adds the remaining nodes and edges to incorporate all the evolutionary information available.

Assumptions:

An ortholog of a gene can only appear once in a gene tree (i.e the phylogeny of a gene is a tree)
For each gene A in the phylogenetic network for species, the ortholog of A must be the direct descendent of its parent/s or equal to the ortholog of the parent

To download this program, please contact shasha@cs.nyu.edu
Back to top

Results

I ran the two algorithms on the two gene trees for genes YBL015W and YCL054W.
Here are the results for algorithm 1 and algorithm 2 Back to top

References:

B.M.E. Moret, L. Nakhleh, T. Warnow, C.R. Linder, A. Tholse, A. Padolina, J. Sun, and R. Timme, "Phylogenetic networks: modeling, reconstructibility, and accuracy." To appear in the IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1), 2004. PDF

L. Nakhleh, T. Warnow, and C.R. Linder, "Reconstructing Reticulate Evolution in Species -- Theory and Practice." Proceedings of the Eighth Annual International Conference on Research in Computational Molecular Biology (RECOMB 2004), 337-346.PDF

D. Gusfield, S. Eddhu, and C. Langley. "Efficient Reconstruction of Phylogenetic Networks with Constrained Recombination." Proceedings of Computational Systems Bioinformatics (CSB 03), 2003. PDF

W.P. Maddison. "Gene Trees in Species Trees" systematic biology, 46(3):523-536, 1997.

Genome-scale approaches to resolving incongruence in molecular phylogenies

Deriving the genomic tree of life in the prescence of Horizontal Gene Transfer: Conditioned reconstruction

Lake 1994 - Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances

NeighborNet: An agglomerative method for the construction of planar phylogenetic networks