# The Graph of Life

Chandni Rajan Valiathan
Howard Hughes Honors Summer Institute
Summer 2004
chandni@brandeis.edu

Dennis Shasha
Courant Institute of Mathematical Sciences
Department of Computer Science
New York University
shasha@cs.nyu.edu

Contents:

## Introduction:

Phylogeny is a tool biologists use to represent the evolutionary relationship between different species. Historically, most believed that this evolutionary relationship could be represented using trees. With the advent of genomics, numerous computer algorithms have been used to build phylogenetic trees. However discrepancies observed in phylogenetic trees lead some (including us) to believe that evolution does not actually occur in a tree-like fashion. Moreover, events such as hybridization or horizontal gene transfer occur in nature and play an important role in evolution. Trees are incapable of representing such events. Therefore, to represent evolution accurately, we need to use a network representation.

## Problems with trees:

One reason to think that trees do not give an accurate representation of evolution is that different gene trees for a particular set of species vary. The reasons for these discrepencies could be that:
1. The input data was erroneous
2. Reticulation events such as horizontal gene transfer and hybridization has taken place
3. Convergent evolution has taken place

## Phylogenetic Networks:

Reticulation events can be represented using Directed Acyclic Graphs (DAGs) of a certain form. There are two kinds of nodes in these special DAGs: tree nodes and reticulation nodes. Tree nodes have one edge entering them and two or more edges leaving them. Reticulation nodes have many edges entering them and one edge leaving them. Reticulation nodes give phylogenetic networks the ability to depict reticulation events.
At present there is no universally accepted network reconstruction algorithm.

## Project Goals

The goals of my project were to study various proposed algorithms and to compare these algorithms and their results.

## Algorithm 1:Galled network reconstruction (Nakhleh et. al)

### Definitions:

• Reticulation Cycle: Suppose there is a node w in a network from which two directed paths come out, and that the two paths meet at a node x. This node x is called a reticulation node, and the 2 directed paths (that start at w and end at x) together form a reticulation cycle.
• Gall: A reticulation cycle that shares no nodes with any other reticulation cycle is known as a gall
• Galled Network: A network in which all the reticulation cycles are galls is called a galled network
• Partition: Every edge e in a tree has two sets of nodes associated with it: the first set contains all those nodes below the edge and the second set contain all the remaining nodes. These two sets of nodes together constitute a partition of e
• Incompatible Edges: An edge e1 in tree T1 is said to be incompatible with tree T2 if there are no edges in tree T2 with the same partition as e1
• Ends of a path: These are the subtrees attached to the first or last nodes of a path.
See the example to see these relationships. In the diagram, there are two reticulation cycles: one made up of the edges (1,2)(2,4)(1,3)(3,4) and the second made up of the edges (5,8)(8,10)(5,9)(9,10). Both of these are galls since none of the nodes in the first reticulation cycle are in the second and vicce versa. Since there are only two reticulation cycles in the graph, and both of them are galls, the network in the example is a galled network.
The partition of the dashed edge (3, 6) is {{g,h},{1,2,4,5,7,8,9,10,11,a,b,c,d,e,f}}

### Basic concept:

1. Given two different gene trees T1 and T2, we first find the incompatible edges in the two trees.
2. For each path p made up of the incompatible edges, we find the subtree X in one of the ends of p such that when X is removed from both trees T1 and T2, we get a common tree T3.
3. Once you find such a subtree X which is always attached to one of the ends of p, we can build the network by adding a new edge from the other end of p to the root of X.

### Assumptions:

1. An ortholog of a gene can only appear once in a gene tree (i.e there is no convergence)
2. The true Phylogenetic nework is a galled network

### Reasons given to justify assumption 2

True evolutionary history is likely to be a galled network if:
• there are moderate levels of recombination
• most recombination events are recent
• we are studying a single recombination event

## Algorithm 2: Ancestral Network Reconstruction (Dennis Shasha)

This algorithm takes as input a set of different gene trees for a particular set of species and constructs a network. Notation: If A is a gene then Ax is a variant or ortholog of A, and Axy is a descendant of Ax.
Suppose the gene tree for a set of species {Sp1, Sp2, Sp3} for gene A is (A1, (A21, A22)) where
Sp1:A1
Sp2:A21
Sp3:A22
Suppose also that the gene tree for the same set of species for gene B is (B11, (B21,B22)) where
Sp1: B22
Sp2:B21
Sp3:B11
Combining this data we get that
Sp1:A1B22
Sp2:A21B21
Sp3:A22B11
We can already see a conflict in that Sp1 and Sp2 must have an ancestor AB2 (which means that gene B must have changed before gene A) whereas Sp2 and Sp3 must have an ancestor A2B (which means that gene A must have changed before B). This algorithm constructs the maximal base tree containing as many nodes as possible that satisfy tree conditions from the given data. Once this base tree is obtained, the algorithm adds the remaining nodes and edges to incorporate all the evolutionary information available.

### Assumptions:

1. An ortholog of a gene can only appear once in a gene tree (i.e the phylogeny of a gene is a tree)
2. For each gene A in the phylogenetic network for species, the ortholog of A must be the direct descendent of its parent/s or equal to the ortholog of the parent