By examining corresponding gene variants (orthologs) in different species, we may infer the history of evolution. The most common way to demonstrate the result is to use the notation of the phylogenetic tree: a rooted, leaf-labeled tree. Each leaf has a label representing the name of the current species. For example:(root)
In phylogenetics, trees are usually represented by well parenthesized strings which is also known as the Newick tree format. Therefore the above tree will be recorded as:
(((((humans,monkeys),dogs),birds),frogs),sharks)Generally, phylogentic trees are unordered. Meaning, the order between siblings is unimportant. So we can feel free to rewrite the above tree to:
All internal nodes (those numbered in above tree) are hypothetically extinct species, which is a common assumption in phylogenetics. For example, we assume we will not find the common ancestor of humans and monkeys (species 4) because we believe it is extinct.
Another interpretation is through the relationship between species. From the example above, we can say humans and monkeys have a closer relationship than humans and frogs. Similarly, sharks and dogs have a farther relationship than birds and dogs. In fact, gene analysis usually give us only relationships between species and trees are reconstructed accordingly. There are many methods and software available for this purpose but that is independent to our problem. Here we assume trees are already generated. The problem is that we might get different trees by analyzing different genes. Suppose we get the following two trees:
/ \ birds
/\ / \
/ \/ \
humans monkeys dogs
This is not a new idea in biology. In this case monkey is said to be a hybrid species or a species from an ancient interbreeding event. Of course, these events should be rare compared to gene mutations, which cause branching in phylogenetic trees. So our goal is to combine all trees into a graph with a minimum number of interbreeding events.