Our goal is to choose one tree from each gene, then combine those trees into a direct acyclic graph with the fewest interbreeding events. The algorithm can be described in two stages: tree selecting and tree combining

Tree Selecting

  • Consensus Method

  • Our goal is to select one tree from each gene. Therefore we would like to select popular trees since they immediately "cover" may genes. If there is a single tree suggested by all genes, then it suffices to select only that tree. But usually we will see the following situation:

    gene g1 suggests trees t1, t3, t5
    gene g2 suggests trees t1, t2, t4
    gene g3 suggests trees t1, t6, t4
    gene g4 suggests trees t1, t6, t5
    gene g5 suggests trees t2, t4, t7
    gene g6 suggests trees t3, t5, t8

    In this example we can select trees {t1,t2,t3} to cover all genes (at least one tree from each gene). These kind of problems are called hitting set problems. The fewer trees we select, the easier to combine them with a few interbreeding events (This statement is questionable since it would be harder to combine two totally different trees than to combine many similar trees. In practice, however, it turns out that popular trees usually share some common structures and will be similar after all). Therefore we would like to pick the smallest hitting set. So in the above example, {t4,t5} will be the right choice. Such problem is called minimum hitting set problem. In this software, we have developed a hitting set solver to solve it efficiently. See algorithms of the hitting set solver for more details.

    Another thing we might want to do is to ignore trees that rarely appear in the input data since they are supported by very few evidence. Therefore we set a cutoff frequency F in the software to discard trees that appear less than F times in the data. If some genes are so eccentric such that every trees they suggest are discarded, then we ignore these genes as well. The developed algorithm is called Consensus Method:

    1. U <-- the set of all gene trees (duplicates removed)
    2. U <-- U - {t | tree t appear less than F times in the input}
    3. S <-- empty set
    4. for each gene g do
    5.     R <-- {t | t belongs to U and is suggested by g}
    6.     Append R to S
    7. for i in [1..MAX_ITERATION] do
    8.     Ci <-- Hitting_Set_Solver(S, U)
    9. return tree collection Cx which requires the fewest interbreeding events to be combined

    The results can be found in the software webpage.

  • Simulated Annealing Method

  • Consensus method selects trees based on their popularity. However, if all trees appear only once (or very few times) in the data, it is much meaningless to use this method. That is, in that case it works better to select similar trees from each gene instead of looking for identical trees. For this purpose we first define the similarity between trees:


    1. The distance between two trees:
      distance(T1, T2) = the smallest number of interbreeding events needed to combine T1 and T2.
    2. Similarity: two trees are said to be similar if the distance between them is small.

    Once we define the similarity, we can easily come up with some greedy algorithms for tree selecting:

    We develop the following algorithm by combining the above two heuristics. We further randomize it such that it will not always run into the same dead end. The modified algorithm: All these heuristics are nearly deterministic. Therefore they easily fall into some local minima. To get out of local minima solutions, a popular method is called simulated annealing. We first give some definitions before writing down the whole algorithm:



      Simulated_Annealing(T0, T, alpha):
    1. Pick a starting collection C using randomized hybrid heuristics
    2. t <-- T0
    3. while t > T do
    4.     Choose a neighbor C' of C
    5.     d <-- Evaluation(C) - Evaluation(C')
    6.     if d > 0 then
    7.         C <-- C'
    8.     else with probability ed/t :
    9.         C <-- C'
    10.     t <-- t * alpha
    11. return the best tree collection seen through the whole process

    Here T0 and T stands for initial and final temperature, respectively, while the temperature decreases exponentially with a factor alpha every iteration. Note that even if the neighbor is worse than the current solution, there is still some chance for the algorithm to switch to it (especially when t is large since the probability of switching = ed/t). That means it is possible to get out of local minimum, which is an important feature. The spirit of this algorithm is that the initial high temperature will make the whole system unstable. Then the temperature cools down slowly, making it harder to switch without benefit. See an introduction to simulated annealing for more insight.

    In this problem, a single bad tree could poison the solution seriously. So in actual experiments the original greedy solution was rarely improved much. Therefore instead of running a long annealing process, we also might want to restart the algorithm several times to explore the search space as much as we can. This is the final algorithm we developed. The experimental results can be found in the software webpage.

    Tree Combining

    We use the following example (Trees are given in the Newick tree format):

    geneA: ((s1 (s2 s3) s4 s5) (s6 s7 s8))
    geneB: ((s6 (s1 s2) s7 (s3 s4)) (s5 s8))

    s1: {geneA_11, geneB_121}
    s2: {geneA_121, geneB_122}
    s3: {geneA_122, geneB_141}
    s4: {geneA_13, geneB_142}
    s5: {geneA_14, geneB_21}
    s6: {geneA_21, geneB_11}
    s7: {geneA_22, geneB_13}
    s8: {geneA_23, geneB_22}

    All trees here are unordered, that is, the order of siblings is unimportant. So the Newick tree representation is not unique. For example, the following two trees are equivalent to the above two:

    geneA: ((s6 s7 s8) (s1 (s2 s3) s4 s5))
    geneB: ((s6 s7 (s1 s2) (s3 s4)) (s5 s8))

    Also, in Newick tree format, a pair of parentheses corresponds to an internal node of the tree which is the nearest common ancestor for all species enclosed by that parentheses. For a better illustration, see our introduction.




    1. If s1 and s3 have a closer relationship than do s1 and s2 in some gene tree, then any group containing both s1 and s2 must also contains s3.
    2. Assume G is the optimum combined graph, i.e., the one with the fewest interbreeding events. If a subset of species S forms a group, then G has a subtree T whose leaf set is exactly S. The root of T is the common ancestor of species in S. In other words, the descendant set of the root of T is S.
    3. Let S be the descendant set of species A. In that case, the subscripts of orthologs in A will be the longest common prefixes of all corresponding orthologs in species of S.

    Therefore, the algorithm can be described as the following:

    Main algorithm

    1. Grouping: Find all groups in the input data.
    2. Find a minimal group whose species set is denoted by S.
    3. For every gene tree extract the subtree containing only S. Combine these subtrees into a small graph G with the fewest interbreeding events.
    4. In the original trees, replace S by their common ancestor which is the root of G.
    5. Insert G to our final graph.
    6. Repeat step 2~5 till there is only one species left. That species will be the root of the final graph.


    Next step will be solving the tree combining problems in the previous stage. The basic strategy is to remove a minimum number of problematic species that cause conflicts, and then combine all gene trees into one big tree (That is, after removing those species all trees become mutually consistent). After that, insert some intermediate missing species to complete the gene deriving steps. Finally, pair up the removed species with proper parents and transform the big tree into a graph.




    Therefore our approach can be described as the following:

    1. Choose a minimal set of species S and remove them such that there is no conflict triple left
    2. Combine all trees into one tree
    3. Insert the removed species back with proper parents

    Here we assume the species removed came from interbreeding events. So at the last step we will actually associate each of them with multiple parents.

    Clearly, if at the beginning we have a conflict triple (A,B,C), either A or B or C must be in S, otherwise the removal of S will not eliminate this triple. So again we get a hitting set problem:



    We can model each conflict triple as a set of size 3 and K the set of all species. Then the species removing problem in step 1 becomes a hitting set problem. A hitting set solver was developed to solve this NP-complete problem (The hitting set problem remains NP-complete even under the constraint that every set has size 3). Once we find the hitting set, we remove those species from all trees. Now we should be able to combine the rest into a big tree (See Theorem above). In fact this can be done by the same steps in the main algorithm (See the following example).


    After combining the subtrees, we complete gene deriving steps by inserting intermediate hypothetical species.


    Finally, we infer the parents of removed species by their orthologs.


    Following this algorithm, assume we remove s5 and s8 from combining problem 3, 2 more intermediate species will be inserted:

    miss11 {geneA_, geneB_1}
    miss12 {geneA_, geneB_2}

    The complete combined graph:


    We start with a set of two species and recursively bring in species which are close to some species in the set. The algorithm can be described as the following:

    1. Arbitrarily pick species s1 and s2
    2. s+ <-- {s1, s2}
    3. Until s+ does not change do
    4.     Consider a species t outside s+ :
    5.     for each gene g and any two species x, y in s+ :
    6.         if x and t have a closer relationship than do y and t then
    7.             Add t into s+
    8.             break
    9. return s+

    Repeat this procedure with all possible combinations of s1 and s2 such that we can obtain all possible groups (There are some minor things to be worried about. For more details, see the paper).

    Hitting Set Solver

    Here we introduce three approximation/heuristic algorithms solving the general hitting set problem.

    Algorithm 1 - Three approximation algorithm:
    1. H <-- empty set
    2. while H not yet hit all set in C do
    3.     Pick a set not yet hit, add every element in that set to H
    4. return H

    Note in our case that every set in C has size 3, this algorithm has a theoretical guarantee: if the minimum hitting set has size k, then the size of the returned hitting set will not exceed 3k. In computer science jargon, we say this algorithm yields an approximation factor of 3. However, practically it doesn't work very will. This algorithm might not even report a minimal hitting set! For the above example, the output would be {1,4,5,2,6,7} whose size is exactly 3 times the optimum size. In our program, redundant elements in the solution are removed such that the final output will be minimal. However, the performance is still not that great in general. We kept this implementation merely because it reaches the best known approximation ratio.

    Algorithm 2 - Frequency heuristic (greedy) algorithm:
    1. H <-- empty set
    2. while H not yet hit all set in C do
    3.     Choose the element hitting the most un-hit sets and add it to H
    4. return H

    This algorithm also has a theoretical guarantee: an approximation factor of O (lnN), where N is the total number of sets. However, this algorithm is popular not because of the theoretical aspect, but because this seems like the most natural thing to do. For above example, the algorithm will pick 1 first. After that {2,6,7} becomes the only un-hit set and any one element of it will be picked in the second iteration. The final output will be the minimum solution. Unfortunately, this algorithm is almost deterministic therefore it easily falls into some local minima, which prevents further improvements of the solution.

    Algorithm 3 - Random hitting algorithm:
    1. H <-- K
    2. P <-- a random permutation of K
    3. for e in P do
    4.   if H-e is still a hitting set then
    5.     H <-- H-e
    6. return H

    This strategy is quite different. Instead of starting with an empty set, we initially takes all elements in K, which is guaranteed to be a hitting set. Then the algorithm tries to discard elements in a random order. The good news is it always has the opportunity to achieve the optimum solution. If we repeat the algorithm enough times, the probability that it returns a nearly optimum solution will be high. However, if the problem size is huge, we will need to repeat the algorithm many times for a good solution, which makes the software too inefficient.

    Our hitting set solver combines all three strategies since each of them has its pros and cons. The final solver runs the first two deterministic algorithm once and the third randomized algorithm MAX_ITERATION times. Then it reports the best solution seen.

    back to software homepage

    Chien-I Liao