/ April 2004: to do. Figure out which edge brings which genes. / Feb 2004: / The last output tells you the distance from root to each leaf. / You can incorporate the information about which species should be / part of the basic tree in preferredorder. / As of now preferredorder is hard-coded. yech. / January 1, 2004: / Program capabilities: given a bunch of gene family trees with species / at the leaves, / we can / 1. input those trees (gene family names must not have underscores "_"; / as those are used as specifiers, e.g. A_121). / 2. derive dependency edges based on the analysis in the puzzle / 3. determine which groups of species fit into single evolutionary trees. / 4. Produce those evolutionary trees when possible. / 5. determine which subsets of / families allow all species to fit into a single / evolutionary tree. (See graphprepare.k which calls this routine / on different gene family arguments.) / 6. Produce interbreeding models given a base evolutionary tree. / If a species S is a proper ancestor of species S' then we say / S < S' (it comes early in the derivation tree or the derivation / directed graph). / If the earliest (most ancestral) / instance of a symbol Z must always appear in an ancestor / species of a symbol / Z', then we say Z < Z'. / Data preparation: each open paren should have at / least one space before it at least. / At the end, to generate a tree, just do a topological sort. / Root is all the trees up to _ / Then start adding other elements based on edge relationships. / For example if tree2_1 precedes tree9_1 then descend by / copying a parent node and then adding to it. / For example, if we have / S1: A1 B1 / S2: A2 B2 / then there are no cycles. / Start with A B / then go on to A1 B1, A2 B2 immediately / If we have: / S1: A1 B1 / S2: A2 B2 / S3: A3 B1 / then we get B1 < A1, A3 so break down to B1 first and then the As. / If several derive from the same source species then they can all be processed / simultaneously. / Let's see if we can explain this topological sort better. / Start with tree names, e.g. tree1_, tree2_, ... / List all the edges. / A < all suffixes / B < all suffixes / B1 < A1 / B1 < A3 / If we have Bi < Aj, Ak / then advance the B to length 1 before advancing A to length 1. / Otherwise, advance them together. / For example, / A_1111 B_1111 / A_1112 B_1111 / A_1121 B_1211 / A_1122 B_1211 / Edges: A_11 < B_11, B_12; B_1111< A_1111, A_1112; / B_1211 < A _1121, A_1122 / Start with A_ and B_ / Then advance together to A_1 B_1 / Now since A_11 < B_12 and B_11, advance A_11 first. / So, get / A B / A_1 B_1 / A_11 B_1 / A_11 B_11 / A_11 B_12 / Now under A_11 B_11, we advance them both so we get / A B / A_1 B_1 / A_11 B_1 / A_11 B_11 / A_111 B_111 / A_11 B_12 / But now B_1111 < A_1111 and A_1112 in that context, so / we get / A B / A_1 B_1 / A_11 B_1 / A_11 B_11 / A_111 B_111 / A_111 B_1111 / A_1111 B_1111 / A_1112 B_1111 / A_11 B_12 / So, we get the following algorithm. / In the current context (meaning a current leaf e.g. A_111, B_111), / a tree (e.g. the A_ tree) can advance to the next level if it is not / the target of any edge. If it is, then it must wait. / Once the decision is made to advance, then any source of an edge that / matches a leaf tree must be removed. / Let's review the example above. / Initial leaf is A_ B_ / Edges: A_11 < B_11, B_12; / B_1111< A_1111, A_1112; / B_1211 < A _1121, A_1122 / Neither A_i nor B_j is the target of an edge so we advance / them both in the direction of the variant of species they have to go to. / In this case, this means: / A_ B_ / A_1 B_1 / Now, we consider that leaf A_1 B_1. / We notice that A_11 < B_11 and A_11 < B_12, so we can't advance the B, / but we can advance the A_. / A_ B_ / A_1 B_1 / A_11 B_1 / and we eliminate the edges: A_11 < B_11, B_12 / and we expand the B_1 (so this is part of processing of that edge). / It's possible that several edges must be processed tha way if there / are more than two trees. / A_ B_ / A_1 B_1 / A_11 B_1 / A_11 B_11 / A_11 B_12 / Now we consider the A_11 B_11 leaf. / There is no edge whose target is A_111 or B_111 so they both advance / A_11 B_11 / A_111 B_111 / However, there is an edge that stops the next advance for A_. / B_1111< A_1111, A_1112; / So, we get / A_11 B_11 / A_111 B_111 / A_111 B_1111 / A_1111 B_1111 / A_1112 B_1111 / and eliminate the edge / More comments are above constructevtree. / To show where cycles come from, start from a 1 in the diagonal / of the transitive closure matrix and show the steps of the cycle, / annotating with the species from which each edge comes from, / forgetting about edges that derive from ancestor descendant relationships. / Dec. 30, 2003: to get more insight, we want to do two things: / first see which families would give us a tree for all species / (family subset). / Second show how the cross-hybridization could work given a whole set / of families (cross-hybridization). / March 15, 2004: naturetree is a flag that, when set, says we are not on the / original data set. / Family subset: select a family then keep building up and see when / you get all species still in a tree. / Cross-hybridization: To see how that works, for each outspecies, find / a small set of labeled parents whose union contains all the family / specifiers of the outspecies. / e.g. if outspecies is A_22 and B_21 and their already exists two inspecies / (species in a tree) that have A_11 B_21 and A_22 B_11, then combine those. / Now it is also possible that the outspecies has elements that no / inspecies has. For example, the inspecies have / A_11 B_21 C_1 / A_22 B_11 C_2 / whereas the outspecies has A_22 B_11 C_22. In that case, / we put together the parents and then derive further. / It is also possible that the outspecies has A_22 B_11 C_3. / In that case, we must derive from a node that has only C_. / I'm not sure how to guarantee consistency. / Let's try this example / S1: A_11 B_21 C_1 / S2: A_22 B_11 C_2 / S3: A_22 B_21 C_3 / So edges are A_22 < C_2, C_3 / B_21 < C_1, C_3 / B_21 < A_1, A_2 / So, we have a cycle. / Now, we do the following: / A_ B_ C_ / A_1 B_2 C_1 / A_11 B_21 C_1 / A_2 B_1 C_2 / A_22 B_11 C_2 / Now to put these together to get S3, we have a problem. / We seem to need three species: A_11 B_21 C_1, A_22 B_11 C_2, and A_ B_ C_. / What we should do therefore is include all outdependencies / i.e., from the inspecies to the outspecies. / This (including B_21 < C_1, C_3) / would give us the following derivation: / A_ B_ C_ / A_1 B_2 C_ / A_11 B_21 C_ / A_11 B_21 C_1 / A_2 B_1 C_2 / A_22 B_11 C_2 / and then we could derive from the missing species A_11 B_21 C_ / and A_22 B_11 C_2 to get S3 (A_22 B_21 C_3). / What are the precise criteria for determining whether a set S of / species from the evolutionary tree (inspecies) can derive an outspecies t. / 1. For each family-specifier f in t / (e.g. A_2132 is the family specifier and A_ is the family), / there exists an s in S such that either f / is in s or a prefix pf of f is in s and no proper suffix of pf / that is a prefix (proper or improper) / of f is anywhere else in the evolutionary tree. / 2. I think it is also desirable to ensure that no species in S is a prefix / of any other species in S (i.e. each family specifier in the first one / is a prefix of a family specifier in the second with one such prefix / being proper). / We must explore this notion of outdependency further given this. / Say we have: / S1: A1 B1 / S2: A1 B2 / S3: A2 B1 / S3 could be the outspecies, but we don't need to do anything special / because its family specifiers are already S1 and S2 / (though there may be an optimization so we don't need too many inspecies). / But we do need something in this case. / S1: A1 B1 C1 / S2: A1 B2 C1 / S3: A2 B1 C2 / For S1 and S2, we get / A_ B_ C_ / A1 B_ C1 / A1 B1 C1 / A1 B2 C1 / I think that inherently we need an interbreeding with a top level guy / in this case. / / So, let's see whether we can figure out what / our algorithm (unimplemented as of Jan. 1, 2004) for finding the / graph of life is: / 1. take the species we want (call it t) with all its family specifiers. / 2. for each family specifier in t, find the species in the evolutionary / tree with the maximum prefix of that family specifier. / There could be several. / 3. So we have a hitting set kind of problem. Associated with each / species s in the evolutionary tree, we associate the family specifiers in / t that s covers. We choose the smallest such set of species. / / BASIC ROUTINES / find difference between list[0] and list[1] listdiff:{[list] :differ[list[0]; list[1]] } / returns one if x is a subset of y subset:{[x; y] i: y ?/: x : ~ (#y) _in i } / returns one if x is a subset of y subset:{[x;y] (#y) > |/ y ?/: x} differ:{[x;y] x,: () y,: () i: y ?/: x j: & i = #y :?x[j] } / A faster difference, yielding indexes in x that differ from y differindexes:{[x;y] i: y ?/: x j: & i = #y :j } /finds intersection of two lists / fastest of all intersect: {[x;y] x,: () y,: () i: x ?/: y :x[(?i) _dv #x] } /finds intersection of two lists / fastest of all intersect: {[x;y] x,: () y,: () if[(#x) < (#y) i: x ?/: y j: & i < #x :x[?i[j]] ] i: y ?/: x j: & i < #y :y[?i[j]] } /finds intersection of two lists / fastest of all hasintersect: {[x;y] x,: () y,: () if[0 = (#x) & (#y); :0] i: x ?/: y : (&/i) < #x } / x is a proper subset of y propersubset:{[x;y] x,: () y,: () if[~ (#x) < (#y); :0] / must be smaller :(#x) = (#intersect[x;y]) } / x is a proper subset of y propersubset:{[x;y] x,: () y,: () if[~ (#x) < (#y); :0] / must be smaller :subset[x;y] } /finds indexes in x and y that intersect / If x and y are both sets, then the results will be of the same length / fastest of all intersectindexes: {[x;y] i: x ?/: y / where each y hits j: & i < #x / those ys that hit :(i[j];j) } /finds indexes in x that intersect with y intersectleftindexes: {[x;y] i: x ?/: y / where each y hits j: & i < #x / those ys that hit :i[j] } / allows duplicates on left / and on the right intersectleftindexes: {[x;y] i: y ?/: x / where each x hits j: & i < #y / those xs that hit :j } /finds intersection of two lists / and returns index pairs of matches. Assumes no duplicates / in either list intersectbothindexes: {[x;y] x,: () y,: () i: x ?/: y pairs: (i ,' (!#y)) k: & pairs[;0] < #x :pairs[k] } / finds intersection of two lists that may have duplicates bagintersectbothindexes:{[x;y] alreadyused: out: () i: 0 while[i < #x my: x[i] jj: & y ~\: my jj: differ[jj; alreadyused] if[0 < #jj out,: ,(i;*jj) alreadyused,: *jj ] i+: 1 ] :out } / intersect many lists multiintersect:{[lists] size: #lists if[2 > size; :lists] first: lists[0],() jj: ,/ ?:' first (?/:)/: lists[1+ !(size-1)] / find indexes in first x: @[(1+#first) # 0; jj; + ; 1] x: (-1) _ x / delete missing entry kk: & x = size - 1 :first[kk] } / this is a set intersection so we remove duplicates multiintersect:{[lists] size: #lists if[2 > size; :lists] first: lists[0],() jj: first ?/: (,/ ?:' lists[1+ !(size-1)]) / find indexes in first x: @[(1+#first) # 0; jj; + ; 1] x: (-1) _ x / delete missing entry kk: & x = size - 1 :first[kk] } avg:{(+/ x) % # x} var:{avg[_sqr x] - _sqr avg[x]} std:{_sqrt var[x]} cov:{avg[x * y] - avg[x] * avg[y]} corr:{ (cov[x;y])%((std[x]) * (std[y]))} / delay based search corrdelay:{[delay;x;y] x: (-delay) _ x y: delay _ y (cov[x;y])%((std[x]) * (std[y]))} / END BASICS / DUMP TABLE dumptable / formstring takes a list and makes a string formstring:{[list] list,: () : (-1) _ ,/ ($list) ,\: (" ") } formstringvertbar:{[list] list,: () : (-1) _ ,/ ($list) ,\: ("|") } formstringcomma:{[list] list,: () : (-1) _ ,/ ($list) ,\: (",") } / Output a table (a variable) to a text file outfile (string) / e.g. output[`guide; guide; "foobar"] dumptable:{[tablename; table; outfile] out: ,("# "), ($tablename), ("|"), formstringvertbar[!table] first: *!table numofelements: . ("#"), ($tablename), ("."), ($first) i: 0 while[i < numofelements list: table[;i] x: formstring'list / out,: , ($x[0]), (": "), (-1) _ ,/(1 _ x) ,\: (" ") out,: ,formstringvertbar[x] i+: 1 ] outfile 0: out } dumpmatcsv:{[mat; outfile] if[`disregard _in mat[0] mat: (-1) _' mat ] othernums: 1 _ !#mat[;0] i: 1 while[i < #mat[;0] mat[i]: (` $ $ i),mat[i] i+: 1 ] mat[0]: (`num), mat[0] out: , formstringcomma[mat[0]] / first: *!table numofelements: #mat[;0] i: 1 while[i < numofelements list: mat[i] x: formstring'list out,: , (-1) _ ,/x ,\: (",") i+: 1 ] outfile 0: out } / END DUMP TABLE / DATA / these are the output trees evolvetree.treeid: () evolvetree.parentid: () evolvetree.childrenid: () evolvetreelabel.treeid: () evolvetreelabel.nodeid: () evolvetreelabel.label: () evolvetreelabel.endspecies: () / end species below this one / contracted tree so that if a node has only one child, then / its children become the node's children contractevolvetree.treeid: () contractevolvetree.parentid: () contractevolvetree.childrenid: () contractevolvetreelabel.treeid: () contractevolvetreelabel.nodeid: () contractevolvetreelabel.label: () contractevolvetreelabel.endspecies: () / end species below this one tree.treeid: () tree.parentid: () tree.childrenid: () treelabel.treeid: () treelabel.nodeid: () treelabel.label: () / for each node give its preorder number and postorder number and level / where 0 is the root treestats.treeid: () treestats.nodeid: () treestats.preid: () treestats.postid: () treestats.level: () treestats.parent: () treestats.label: () treestats.outline: () / root is treeid, then treeid_0 / then treeid_121 for example alltree.treeid: () alltree.parentid: () alltree.childrenid: () alltreelabel.treeid: () alltreelabel.nodeid: () alltreelabel.label: () alltreestats.treeid: () alltreestats.nodeid: () alltreestats.preid: () alltreestats.postid: () alltreestats.level: () alltreestats.parent: () alltreestats.label: () alltreestats.outline: () / fill the treestats table. / May not need this because we are going to try a different strategy / whereby we look at roots and see whether the paths between them / are the right ones. fillall:{[treeid] i: & tree.treeid = treeid parent: tree.parentid[i] children: tree.childrenid[i] j: & treelabel.treeid = treeid labelnodeids: treelabel.nodeid[j] labellabels: treelabel.label[j] treestats.treeid,: (#j) # treeid z: treelabel.nodeid[j] treestats.label,: labellabels["), ($mytarg),(" for "),mspit[yy[1]] i+: 1 ] :(z; edgeswithcredit) } spitedge:{[pair] (-4) _ ,/($pair) ,\: (" -> ")} / END OF DISPLAY. / APPLICATION SPECIFIC / code to parse phylogenetic tree info. / For example, ( (A B) (C D) E) is in outline format / blank / E / blank / A / B / blank / C / D / So, we have the following algorithm: each left paren starts a new / node that has label blank. The children are those until the balancing / right paren. phyloparse:{[file] lines: 0: file bracketmode: 0 / bracket mode is for reading the name of the tree foundstring: "bad string" bracketstring: "bad treename" treenum: 0 treecurloc: 0 / index in the tree table treelabelcurloc: 0 / index in the treelabel table stringflag: 0 / not currently processing a string i: 0 level: -1 while[i < #lines myline: lines[i] hasbracket:0 j: 0 while[j < #myline / brackets occur at the end of the line if[myline[j] = ("[") bracketmode: 1 bracketstring: "" hasbracket: 1 ] if[myline[j] = ("]") ii: & tree.treeid = treename / tree.treeid[ii]:: ` $ ("tree"), (bracketstring) tree.treeid[ii]:: ` $ (bracketstring) ii: & treelabel.treeid = treename / treelabel.treeid[ii]:: ` $ ("tree"), (bracketstring) treelabel.treeid[ii]:: ` $ (bracketstring) bracketmode: 0 bracketstring: "bad treename" ] if[myline[j] = ("(") newtreeflag: level = -1 level+: 1 / brings it to level 0 from -1 if[newtreeflag nodenum: 0 treenum+: 1 treename: :[hasbracket ` $ bracketstring ` $ ("tree"),($treenum)] treecurloc: ,#tree.treeid tree.treeid,: treename tree.parentid,: nodenum tree.childrenid,: ,() treelabelcurloc: ,#treelabel.treeid treelabel.treeid,: treename treelabel.nodeid,: nodenum treelabel.label,: ` nodenum+: 1 ] if[~ newtreeflag tree.childrenid[treecurloc[(#treecurloc)-1]],: nodenum treecurloc,: #tree.treeid treelabelcurloc,: #treelabel.treeid tree.treeid,: treename tree.parentid,: nodenum tree.childrenid,: ,() treelabel.treeid,: treename treelabel.nodeid,: nodenum treelabel.label,: ` nodenum+: 1 ] ] if[stringflag & (myline[j] _in (") ")) / either right paren or blank tree.childrenid[treecurloc[(#treecurloc)-1]],: nodenum treelabel.treeid,: treename treelabel.nodeid,: nodenum treelabel.label,: ` $ foundstring nodenum+: 1 stringflag: 0 / no longer processing a string foundstring: "bad string" / reinitialize ] if[myline[j] = (")") / important that it goes after string level-: 1 treecurloc: (-1) _ treecurloc treelabelcurloc: (-1) _ treelabelcurloc ] x: _ic[myline[j]] if[(((x > 47) & (x < 58)) | ((x > 64) & (x < 91)) | ((x > 96) & (x < 123))) / we have a character if[stringflag = 1 foundstring,: myline[j] ] if[(stringflag = 0) & (bracketmode = 0) foundstring: ,myline[j] stringflag: 1 ] if[(bracketmode = 1) bracketstring,: myline[j] ] ] j+: 1 ] i+: 1 ] } / give an outline label to a node in the derivation of a particular / gene family. laboutline:{[treeid] ii: & treestats.treeid = treeid labels: ($treeid),("_") j: 0 while[j < #ii k: ii[j] if[treestats.parent[k] > (-1) par: treestats.parent[k] kk: & (tree.treeid = treeid) & (tree.parentid = par) kids: tree.childrenid[*kk] num: kids ? treestats.nodeid[k] / position among kids if[num = #kids; !-11] / error should find that kid kk: & (treestats.nodeid = par) & (treestats.treeid= treeid) if[~ 1 = #kk; !-12] / error should be one kk: *kk treestats.outline[k]: delend[treestats.outline[kk]],($1+num) ] j+: 1 ] } / get rid of blanks at either end of the string delend:{[string] if[0 = #string; :""] if[string ~ ,"" ; :""] i: & ~ string = " " if[(#string) = (#i); :string] if[0 = (#i); :""] string: (- ((#string) - (1 + *|i))) _ string :(*i) _ string } / characterize the species by what they contain in each gene family charspecies:{[] ii: & ~ treestats.label = ` part: = treestats.label[ii] uniqs: ?treestats.label[ii] outlabs: ` $' treestats.outline[ii[part]] :uniqs,' outlabs } / sp1 and sp2 are two species. / We do this as follows: we look for pairs of families / that they both have (a family is characterized by a tree). / So we are looking for trees that they both have. / We look at the values of the two species for the first tree / then we look at the values for the second tree. / sp1: X Y / sp2: X' Y' / We have two relationships: / last(X,X') < first(Y,Y') / last(Y,Y') < first(X, X') / We care however only if the "last" function returns something other than / the basic tree (e.g. tree12_2 not just tree12_). / We return these two species as well as the edges we found. / Provenance means that this pair proves the point. / Why can we go pair or species by pair of species? / Suppose we have / S1: A1 B1 C1 / S2: A1 B1 C2 / S3: A2 B2 / This doesn't imply that A1 < B2. We can simply have the derivation / A B C / A1 B1 C / A1 B1 C1 / A1 B1 C2 / A2 B2 / On the other hand let's say we have / S1: tree2_11 tree12_221 / S2: tree2_2221 tree12_2222 / For tree12_22 to evolve just once, we must have tree2_1 and tree2_2 / evolve later. processpair:{[pair] sp1: pair[0] sp2: pair[1] t1: findtrees[sp1] t2: findtrees[sp2] inboth: intersect[t1; t2] pairs: ,/inboth ,\:/: inboth if[2 > #pairs; :()] pairs@: & pairs[;0] < pairs[;1] :,/findedges[sp1; sp2]'pairs } / find the trees in a species (these correspond to traits) / Just the tree names themselves findtrees:{[sp] i: species[;0] ? sp :trimmedtrees[i] } findedges:{[sp1; sp2; treepair] i: species[;0] ? sp1 kk: 1 + intersectleftindexes[trimmedtrees[i]; treepair] t1: species[i;kk] i: species[;0] ? sp2 kk: 1 + intersectleftindexes[trimmedtrees[i]; treepair] t2: species[i;kk] out: () last1: lastalike[t1[0]; t2[0]] first2: firstdiffer[t1[1]; t2[1]] if[~ ("_" = *|$last1 ) / we are interested only if last in common is beyond `tree3_ say. xx: last1 ,/: first2 yy: ((#xx) # sp1),' ((#xx) # sp2) out,: xx,'yy ] last2: lastalike[t1[1]; t2[1]] first1: firstdiffer[t1[0]; t2[0]] if[~ ("_" = *|$last2) xx: last2 ,/: first1 yy: ((#xx) # sp2),' ((#xx) # sp1) out,: xx,'yy ] / if[0 < #out; !-4] :out } / given two family members, determine whether the first is a prefix / of the second. isprefix:{[f1; f2] x1: $f1 x2: $f2 if[(#x1) > (#x2); :0] :x1 ~ (x2[!#x1]) } isproperprefix:{[f1; f2] x1: $f1 x2: $f2 if[((#x1) > (#x2)) | ((#x1) = (#x2)); :0] :x1 ~ (x2[!#x1]) } / get previous family member prev:{[f] x: $f x: (-1) _ x : ` $ x } / given two family specifiers, e.g. tree1_222 and tree1_232, / (here, tree1 represents one gene family and _222 is a derivation / from that) / lastalike will find the greatest common prefix of the two. / For this example, that would be tree1_2 lastalike:{[f1; f2] x1: $f1 x2: $f2 x1@: !((#x1) & (#x2)) x2@: !((#x1) & (#x2)) :` $ x1[! ((x1=x2) ? 0)] } / find the first place where they differ / and print out the pair firstdiffer:{[f1; f2] x1: $f1 x2: $f2 x1@: !((#x1) & (#x2)) x2@: !((#x1) & (#x2)) i: ((x1=x2) ? 0) out: () if[i < #x1; out,: ` $ x1[!i+1]] if[i < #x2; out,: ` $ x2[!i+1]] :out } trimtotree:{[symb] x: $symb k: x ? "_" :` $ x[!k] } / given a bunch of edges / add the implicit edges based on subscripts, / e.g. if you see the symbol `tree10_112, then add / `tree10_1 `tree10_11 / `tree10_11 `tree10_112 expand:{[pairs] if[0 = #pairs;:()] x: ? pairs[;0],pairs[;1] out: ,/expandone'x :?out } / e.g. if you see the symbol `tree10_112, then add / `tree10_1 `tree10_11 / `tree10_11 `tree10_112 expandone:{[symb] out: () x: $symb i: 2 + (x ? "_") / e.g. for "tree10_21" i would be 8 if[(#x) < (i+1); :()] / of the form tree10_2 cur: ` $ x[!i] while[i < #x next: ` $ x[!i+1] out,: ,(cur; next) cur: next i+: 1 ] :out } / transitive closure / build matrix and then do multiplication as needed findcycles:{[edges] mat: buildmatrix[edges] tmpmat:: mat :transclos[mat] } / find transitive closure of adjacency matrix transclos:{[mat] flag: 1 while[flag newmat: () tmat: +mat i: 0 while[i < #mat yy: |/'mat[i] */: tmat newmat,: ,yy |' mat[i] i+: 1 ] flag: ~ mat ~ newmat mat: newmat ] :mat } / build edge matrix buildmatrix:{[edges] els: ?edges[;0],edges[;1] num: #els mat: (num; num) # 0 i: 0 while[i < #edges pair: edges[i] j1: els ? pair[0] j2: els ? pair[1] mat[j1;j2]: 1 i+: 1 ] :mat } findfirst:{[listofspecies] : *:' listofspecies } preferredorder: `soma `long `herb `turn `raim `hirsD `davi `tril `schw `goss / ??? TO BE USED TO SPECIFY PREFERENCE FOR SPECIES / TO BE IN A TREE (FIRST ONES GET PREFERENCE) naturetree: 1 / if doing results for nature paper if[naturetree; preferredorder: ()] / used only for species rearrange:{[list] if[0 = #preferredorder; :list[(#list) _draw -#list]] ii: preferredorder ?/: list[;0] :list @ < ii / ii: list[;0] ?/: preferredorder / :list[ii[& ii < #list]] } / LAY OUT TREE / So, we get the following algorithm. / In the current context (meaning a current leaf e.g. A_111, B_111), / a family (e.g. the A_ family can advance to / A_1112) can advance to the next level if the result (A_1112) is not / the target of any dependency edge. If the result / is a target of a dependency edge, then the family must wait. / When the source of an edge is removed, then the edge is removed. / However there is a further constraint: / We associate with each current derivation the set of species / that it will go to. / That is called the derivation set. / Associated with each of those species we have a vector for each / family saying what the progression needed is for that family. / If a family is not part of a species we establish a wildcard for that family / Then we evaluate which species are served by an advance of family f / to a particular family specifier say f_i. / The property we want is that when we partition the species into / subsets P1 and P2, no family specifier appears in both. (KEY PROPERTY) / (Note that it is allowed for incompatible specifiers to appear in / one partition. Neither will be expressed now.) / The reason is that such a situation would lead to duplicate evolution. / For each possible family specifier (non-wild card specifier / that is allowed as far as dependency edges) / get the sets of species it goes to (goodsets). / Also find the sets it is forbidden to go to (badsets). / To compute badsets, take all species and subtract the sets that the species / specifier belongs to union the wildcards. / So, now I have specifier -- goodsets, badsets. / A partition should either cover goodsets and badsets in which case / we don't advance that one in that partition. / Or it covers only the goodsets in which case we do advance. / / Let's try an example: / S1: A11 B1 C2 / S2: A1 B2 D1 / S3: A2 B32 D2 / Then we start with A B C D and associate that with S1, S2, S3 the / derivation set (curdestinies holds the indexes to these) / S1 has advances: A1 A11; B1; C2; D* / S2: A1; B2; C*; D1 / S3: A2; B3 B32; C*; D2 / Now the dependencies are A1 < B1, A1 < B2. / So, neither B1 nor B2 can be tried. / Let's consider A1: goodset is S1, S2; badset is S3. / C2: goodset is S1; badset is empty. / B3: goodset is S2; badset is S1, S2. / Because there is a specifier (C2) that has an empty badset, we advance / that one and take the species to be {S1, S2, S3, S4}. / We advance nothing else because they have both goodsets and badsets in / that partition. / A B C2 D / Now, / S1 has advances: A1 A11; B1; D* / S2: A1; B2; D1 / S3: A2; B3 B32; D2 / So, A1 has goodset: S1, S2 and badset: S3 / B1 and B2 are still not possible. B3 has goodset: S3 and badset: S1, S2 / D1 has goodset S2 and badset S3. / D2 has goodset S3 and badset S2. / If we try to partition into {S1, S2} and {S3} / A1 advances in the first partition as does D1. / A1 B C2 D1 / and eliminate the dependency from A1 to B1 and B2 (it is no longer / necessary once A1 appears). / Eventually B3 and D2 can advance in S3 / so advance to S3: A2 B3 C2 D2 / Now, continuing with A1 B C2 D1 / / we get S1: A11; B1 / S2: B2 / So break these apart to give usthe derivation tree. / A11 goodset is S1 and badset is S2 (note that there is no wildcard for / A in S2, just no element left). Similarly B1 has a goodset in S1 and / a badset in S2. / Let's review another example: . / S1: A_1111 B_1111 / S2: A_1112 B_1111 / S3: A_1121 B_1211 / S4: A_1122 B_1211 / Initial leaf in evolutionary is A_ B_ / Dependency Edges: A_11 < B_11, B_12; / B_1111< A_1111, A_1112; / B_1211 < A _1121, A_1122 / curadvances: / S1: A_1 A_11 A_111 A_1111 / B_1 B_11 B_111 B_1111 / S2: A_1 A_11 A_111 A_1112 / B_1 B_11 B_111 B_1111 / S3: A_1 A_11 A_112 A_1121 / B_1 B_12 B_121 B_1211 / S4: A_1 A_11 A_112 A_1122 / B_1 B_12 B_121 B_1211 / No dependency for A_1 or B_1 and these are everywhere so no bad sets / A_ B_ / A_1 B_1 / and all species are part of that partition. / Now, only A_11 is possible and it is everywhere / A_ B_ / A_1 B_1 / A_11 B_1 / We still have all species in a single paritition. / The dependency edges: A_11 < B_11, B_12 / no longer apply. / Now we can advance B_11 and B_12 but we do these in groups so / S1 and S2 stay together as do S3 and S4. / A_ B_ / A_1 B_1 / A_11 B_1 / A_11 B_11 (S1, S2) / A_11 B_12 (S3, S4) / We get two children here because we are going in two different directions. / Now we consider the A_11 B_11 leaf. / There is no edge whose target is A_111 or B_111 so they both advance / A_11 B_11 / A_111 B_111 / However, there is an edge that stops the next advance for A_. / B_1111< A_1111, A_1112; / So, we get / A_11 B_11 / A_111 B_111 / A_111 B_1111 / Eventually, this will allow us to advance A_ too. / A_1111 B_1111 / A_1112 B_1111 / and eliminate the edge. / We will generate species names miss0 etc and give their / species entries if we reach the bottom. / Some other notes: basic procedure is to go down breadth first. / So we keep a list of all stages taken (e.g. A_11 B_2) currentevolve / and the indexes of the currently active leaves. / For filtering purposes we associate with each place in currentevolve / the remaining dependency edges that remain (lots of space) in curmyedges / and the end indexes that this position will become in curdestinies. constructevtree:{[fulldisplay; curspecies; alledges] myedges: ? alledges evtreeid: ` $ spitunder[curspecies] nodenum: 0 ends: findendpoints'curspecies / find all the family results we want / e.g. tree12_2113 alltrees: () i: 0 / ends are the list of families with their suffixes. while[i < #ends x: ends[i] y: trimtotree'x jj: < y ends[i]@: jj alltrees,: y i+: 1 ] alltrees?: / all the possible trees endall: ,/ends starts: ?,/findtrees'curspecies / root node starts@: < starts missnum: 0 x: ` $ ("miss"),($missnum) missnum+: 1 currentevolvelab: :[fulldisplay ` $ spitcomma[ x, starts] x] starts: addunder'starts / add underbars currentevolve: ,starts curmyedges: ,myedges curadvances: constructdervec[alltrees]'ends / gives vectors of derivations / For each species, get a list of lists. / each of which is a tree and the derivation / of that tree to the end: T T_1 T_12 T_121 curdestinies: ,!#ends / these are indexes into curadvances leaves: ,0 / indexes in currentevolve where the trees are / initializing these local variables evolvetree_treeid: evtreeid, () evolvetree_parentid: nodenum, () evolvetree_childrenid: ,() / place holder evolvetreelabel_treeid: evtreeid, () evolvetreelabel_nodeid: nodenum, () evolvetreelabel_label: currentevolvelab, () nodenum+: 1 / get the next leaf, consider the dependencies and consider the / ancestor vecs. / If it is approaching only one end, then have it go that way as before. / Otherwise, first find the subset of advances that are possible / given the edges. / Then evaluate the advance of each family to see which species it / covers. Take only maximal ones and advance those. / The others become a separate leaf with their own advances. / We then see whether any of those is the target of an edge. foundend: () while[0 < #leaves first: *leaves leaves: 1 _ leaves my: currentevolve[first] / the family elements of that leaf / e.g A_12 B_2 C_121 firstedges: curmyedges[first] / dependency edges / tells us which ends (final species) my is prefix of / returns 1 for those that this is prefix of / returns 2 if this is equal to an end zz: curdestinies[first] / which ends we are near kk: & subset[; my]'ends if[hasintersect[kk; foundend] ` 0: ,"Unusual event: an existing species is not a leaf" ] / if[0 < #kk / change its label if[(0 < #kk) & (~ subset[kk;foundend]) / change its label foundend,: kk / ends[kk] should match my iqlabel: evolvetreelabel_nodeid ? first / it must already be in x: curspecies[(*kk)] currentevolvelab: :[fulldisplay ` $ spitcomma[ x, my] x] evolvetreelabel_label[iqlabel]: currentevolvelab ] ii: differ[curdestinies[first]; foundend] / do not reprocess these if[0 < #ii / Now we process advance and take care of dependencies. / This node will have a child. iq: evolvetree_parentid ? first if[iq = #evolvetree_parentid / need to add this leaf evolvetree_treeid,: evtreeid evolvetree_parentid,: first evolvetree_childrenid,: ,() ] allnext: ,/ *:' ,/curadvances[ii] / ??? What happens when these lists go empty? allnext: differ[allnext; trimtotree'allnext] / Should be all next steps for these end species. tmpfirstedges: firstedges / Now determine which of the next is allowed. / allnextall: ?,/allnext x: () if[0 < #tmpfirstedges x: () xjj: 0 while[xjj < #allnext xmm: & (tmpfirstedges[;1]) = (allnext[xjj]) if[0 < #xmm others: allnext _di xjj xqq: 0 isbad: 0 while[(isbad = 0) & (xqq < #others) xz:|/isprefix[others[xqq]]'tmpfirstedges[xmm;0] / if xz=1 then others[xmm] is dependent on / others[xqq] that is also an advance isbad|: xz xqq+: 1 ] if[isbad; x,: allnext[xjj]] ] xjj+: 1 ] ] xnext: differ[allnext; x] / which families can advance e.g. A_1 to A_12 / as far as dependencies are concerned pair: findmaximal[curadvances; ii;xnext] / get the most ubiquitous legal advances / and say which of the elements of ii they belong to xnext: pair[0] nextii: ii[pair[1]] / nextii are the species that will advance. / Now figure out what remains. oldii: differ[ii; nextii] if[0 < #oldii / don't change anything except number of endpoints curdestinies,: ,oldii curmyedges,: ,tmpfirstedges / these haven't changed xnum: (#curdestinies) - 1 leaves,: xnum currentevolve,: ,my / now place in data structures of record evolvetree_childrenid[iq],: xnum evolvetreelabel_treeid,: evtreeid evolvetreelabel_nodeid,: xnum x: ` $ ("miss"),($missnum) missnum+: 1 currentevolvelab: :[fulldisplay ` $ spitcomma[ x, my] x] evolvetreelabel_label,: currentevolvelab ] / Now take care of dependency edges for branch that advances / Remove anything that is hereby advanced if[0 < #tmpfirstedges qq2: intersectleftindexes[tmpfirstedges[;0]; xnext] tmpfirstedges: tmpfirstedges _di qq2 ] / Now take care of new stuff if[0 = #nextii; !-15] / this would mean no advance at all? y: mixandmatch[my; xnext] / figure out where we should / be for each family y@: < y alreadyin: y _in currentevolve / if[y _in currentevolve; yy: currentevolve ? y; !-15] if[alreadyin !-16 / should never create a node with same family / members except for the unchanging part yy: currentevolve ? y if[~(#curmyedges[yy])=#intersect[tmpfirstedges;curmyedges[yy]] alreadyin: 0 / condition fires when curmyedges[yy] is a / proper superset of tmpfirstedges. / This means edges haven't all been handled / in previous one (the yy one), / probably because of missing families / So in this case, add this as a new member of / currentevolve. ] if[(~ yy _in leaves) & alreadyin; !-14; leaves: yy,leaves] ] if[~ alreadyin curmyedges,: ,tmpfirstedges / these edges will no longer apply / for descendants of this leaf. currentevolve,: ,y curdestinies,: ,nextii curadvances[nextii]: trimadvance[xnext]'curadvances[nextii] curmyedges,: ,tmpfirstedges / these haven't changed xnum: (#curdestinies) - 1 leaves,: xnum / now place in data structures of record evolvetree_childrenid[iq],: xnum evolvetreelabel_treeid,: evtreeid evolvetreelabel_nodeid,: xnum x: ` $ ("miss"),($missnum) missnum+: 1 currentevolvelab: :[fulldisplay ` $ spitcomma[ x, y] x] evolvetreelabel_label,: currentevolvelab ] ] ] evolvetree.treeid,: evolvetree_treeid evolvetree.parentid,: evolvetree_parentid evolvetree.childrenid,: evolvetree_childrenid evolvetreelabel.treeid,: evolvetreelabel_treeid evolvetreelabel.nodeid,: evolvetreelabel_nodeid evolvetreelabel.label,: evolvetreelabel_label x: findend[evolvetree_parentid; evolvetree_childrenid; evolvetreelabel_nodeid; evolvetreelabel_label; curspecies] evolvetreelabel.endspecies,: x } findend:{[parent; children; node; label; ends] slabel: shorten'label out: (#node) # ,() ii: slabel ?/: ends out[ii]: ,:' ends partodo: !(#parent) marks: () j: 0 while[j < #children xx: (#children[j]) # 0 qq: intersectleftindexes[children[j]; nodes[ii]] xx[qq]: 1 marks,: ,xx j+: 1 ] while[0 <#partodo j: 0 todelete: () while[j < #partodo k: partodo[j] if[1 = &/marks[k] todelete,: k out[parent[k]],: ?,/out[children[k]] qq: & (parent[k]) _in/: children if[1 < #qq; !-13] / not a tree if[0 < #qq r: (children[*qq]) ? (parent[k]) marks[(*qq);r]: 1 ] ] j+:1 ] partodo: differ[partodo;todelete] ] :out } / alltrees has all the families trimmed / endsig has the definition of a species e.g. A_12 B_23 C_1 / We try to form the advance vectors for this, starting from alltree / and extending to the last each element of endsig. / If some element of alltrees is not represented in endsig then / represent that one as a wildcard. / e.g. if alltrees is A_ B_ C_ D_ / then get A_ A_1 A_12 / B_ B_2 B_23 / C_ C_1 / D_ constructdervec:{[alltrees;endsig] out: () sigtrees: trimtotree'endsig wilds: differ[alltrees;sigtrees] out,: ,:'wilds out,: splayout'endsig :out } / given a specifier e.g. A_1231 print out A_1 A_12 A_123 A_1231 splayout:{[mysig] x: $mysig i: x ? "_" i+: 2 :` $ x @ !:'[i+ !(1 + (#x) - i)] } / look at curadvances[destinyindexes] and for each element in xnext / see where that one is present (or matches a wild card). / Form goodset and badset for each element of xnext. / A safe partition P has the property that for each element e of nextset, / either goodset(e) is a subset of P / or goodset(e) is disjoint from P. / P is live for e (there is progress) if there is some e such that / goodset(e) is a subset of P and badset(e) is disjoint from P. / Return a safe partition that is live for at least one element. / (A partition that is live for at least one element is called lively.) / If we return a safe partition Q, then we also return all elements live / for Q. / Algorithm: Start with all species. / If safe and lively, then return. / Else remove one by one in breadth first manner and test. / / Return the elements of xnext that / are live and / indexes of the match (indexes into destinyindexes) / These are a pair. findmaximal:{[curadvances; destinyindexes;xnext] myadvances: curadvances[destinyindexes] all: !#destinyindexes / indexes into destinyindexes. They represent / the universe of species potnextones: () i: 0 while[i < #myadvances x: myadvances[i] qq: & 0 < #:'x / at least one element potnextones,: ,x[qq;0] i+: 1 ] goodsets: () wildsets: () i: 0 while[i < #xnext my: xnext[i] goodsets,: ,& my _in/: potnextones mytree: trimtotree[my] wildsets,: ,& mytree _in/: potnextones i+: 1 ] outsets: goodsets,'wildsets / What we want is a set that has all instances / of an advancing specifier (e.g. A_12) and no instance of an / incompatible specifier (e.g. A_11). badsets: differ[all]'outsets alreadytried:: () pair: findsafeandlively[goodsets; badsets; all] if[0 = pair[0]; !-14] / found nothing safe and lively bestset: pair[1] jj1: & subset[;bestset]'goodsets jj2: & hasintersect[;bestset]'badsets kk: differ[jj1; jj2] / Now check: any element whose goodset intersects bestset / should have the property that its goodset is a subset of bestset jj3: & hasintersect[bestset]'goodsets if[0 < (#differ[jj3; jj1]) ; !-15] / result is not safe :(xnext[kk]; bestset) } / findsafeandlively sees whether the candidate is safe and lively / A safe partition P has the property that for each element e of nextset, / either goodset(e) is a subset of P / or goodset(e) is disjoint from P. / P is live for e (there is progress) if there is some e such that / goodset(e) is a subset of P and badset(e) is disjoint from P. / Return a safe partition that is live for at least one element. / (A partition that is live for at least one element is called lively.) findsafeandlively:{[goodsets; badsets; candidate] if[0 = #candidate; :(0;())] alreadytried,: ,candidate kk: & ~ hasintersect[candidate]'badsets / candidates for liveness / get the guys whose badsets are disjoint from candidate if[0 < #kk / there is some liveness jj1: & ~ subset[;candidate]'goodsets / So these sets do not lie entirely within candidate jj2: & hasintersect[candidate]'goodsets / These sets overlap candidate if[0 = hasintersect[jj1; jj2] / so any goodset is either a subset or disjoint / A partial overlap would be in jj1 (because it wouldn't be / a completely contained though possibly improper subset) / and in jj2 because of the overlap. :(1; candidate) ] ] / if reached here then nothing live or not safe. j: 0 while[j < #candidate poss: candidate _di j poss@: < poss if[~ poss _in alreadytried pair: findsafeandlively[goodsets; badsets; candidate _di j] if[1 = pair[0]; :(1; pair[1])] ] j+: 1 ] :(0; ()) } / for families not in xnext, take the current family specifiers / For the rest, take the elements of xnext. mixandmatch:{[current; xnext] currenttrees: trimtotree'current nexttrees: trimtotree'xnext oldtrees: differ[currenttrees; nexttrees] jj: intersectleftindexes[currenttrees; oldtrees] :current[jj], xnext } / form a new advancelist that has the xnexts removed. trimadvance:{[xnext; advancelist] jj: intersectleftindexes[advancelist[;0]; xnext] advancelist[jj]: 1 _' advancelist[jj] :advancelist } / find out where each family must end up in order to match all species / in this particular evolution tree. / Just collect end results from each species in group findendpoints:{[myspecies] i: (species[;0]) ? myspecies : 1 _ species[i] } / UNUSED / is the current leaf in the right direction for this end point / if so, then return current relevant families (not all are necessarily / relevant because endpoint may have fewer families than current one) / and their next steps. isdir:{[curleaf;endpoint] curleaf@: < curleaf curfams: trimtotree'curleaf endfams: trimtotree'endpoint if[(#?curfams) < #curfams; !-11] / each family should be represented only once if[(#?endfams) < #endfams; !-12] / each family should be represented only once ii: intersectleftindexes[curfams; endfams] if[(#ii) < #endfams; :(0;())] / curfams should be a superset of endfams / because we start with all families and / then specialize to particular ones. / If not, then no way this is a prefix. x :&/ isprefix'[curleaf[ii]; endpoint] if[x = 0; :(0;())] / not a prefix y: findnext'[curleaf[ii]; endpoint] if[ (y[;0]) ~ (y[;1]); :(2; ())] / already done :(1; y[;0]; y[;1]) } / if we have tree3_12 and we want tree3_1221 then findnext will / return tree3_122 findnext:{[fam1; fam2] if[fam1 = fam2; :(fam1; fam1)] x1: $fam1 x2: $fam2 : (` $ x1; ` $ x2[!(#x1) + 1]) } / Given a set of treenodes, find the family specifiers associated with / each of them. / Contract a treeid in evolvetree so every node has at least two children / There is still a bug as of March 23 2004: we keep certain nodes near / leaves because of the grandfather clause. The constructdot routine / does this correctly. contract:{[treeid] if[treeid _in contractevolvetree.treeid; :()] / don't do it twice ii: & evolvetree.treeid = treeid myparentid: evolvetree.parentid[ii] mychildrenid: evolvetree.childrenid[ii] jj: & evolvetreelabel.treeid = treeid mylabelnodeid: evolvetreelabel.nodeid[jj] mylabel: shorten'evolvetreelabel.label[jj] myend: evolvetreelabel.endspecies[jj] if[0 &(0 < #interbreedneeded) qq: intersectleftindexes[mylabel; interbreedneeded] needednodes: mylabelnodeid[qq] j: 0 while[j < #needednodes vv: & (needednodes[j]) _in/: mychildrenid needednodes,: myparentid[vv] j+: 1 ] qq: intersectleftindexes[mylabelnodeid; needednodes] interbreedneeded:: mylabel[qq] ] invalid: () j: 0 while[j < #myparentid if[~ myparentid[j] _in invalid enough: 1 < #mychildrenid[j] if[enough / at least two kids contractevolvetree.treeid,: treeid contractevolvetree.parentid,: myparentid[j] contractevolvetree.childrenid,: ,mychildrenid[j] ] if[~ enough mykid: *mychildrenid[j] q: myparentid ? mykid grandchildflag: q < #myparentid q2: mylabelnodeid ? mykid needthis: (mylabel[q2]) _in interbreedneeded if[grandchildflag & (~ needthis) / the kid has its own children and is unneded / so splice out the kid invalid,: mykid mychildrenid[j]: mychildrenid[q] j-: 1 / will redo this ] if[needthis | (~ grandchildflag) / this contractevolvetree.treeid,: treeid contractevolvetree.parentid,: myparentid[j] contractevolvetree.childrenid,: ,mychildrenid[j] ] ] ] / invalid test j+: 1 ] jj: & evolvetreelabel.treeid = treeid mynode: evolvetreelabel.nodeid[jj] mylabel: evolvetreelabel.label[jj] myend: evolvetreelabel.endspecies[jj] kk: &contractevolvetree.treeid = treeid good: ?,/contractevolvetree.parentid[kk],contractevolvetree.childrenid[kk] qq: intersectleftindexes[mynode; good] contractevolvetreelabel.treeid,: (#qq) # treeid contractevolvetreelabel.nodeid,: mynode[qq] contractevolvetreelabel.label,: mylabel[qq] contractevolvetreelabel.endspecies,: myend[qq] } / FIND THE GRAPH OF LIFE / Find the graph of life: / 1. take the species we want (call it t) with all its family specifiers. / 2. for each family specifier in t, find the species in the evolutionary / tree with the maximum prefix of that family specifier. / There could be several. / 3. So we have a hitting set kind of problem. Associated with each / species s in the evolutionary tree, we associate the family specifiers in / t that s covers. We choose the smallest such set of species. / treenodes are the nodes in the evolutionary tree (perhaps contracted) / target is the target species. constructinterbreed:{[evolvetreelabel; inspecies; target] / localtree: contractevolvetreelabel / should not do this as / nodes necessary for interbreeding may thereby disappear localtree: evolvetreelabel / find family specifiers of target i: allspecies[;0] ? target targfam: 1 _ allspecies[i] / find family specifiers of each interior node of tree treename: ` $ spitunder[inspecies] ii: & localtree.treeid = treename x: getfieldscomma'$localtree.label[ii] inspeciesname: ` $ x[;0] inspeciesfamspecs: 1 _' ` $' x allfamspecs: ? ,/inspeciesfamspecs bestmatch: () j: 0 while[j < #targfam myfam: targfam[j] ii: & isprefix[;myfam]'allfamspecs lens: #:' $allfamspecs[ii] maxlen: |/lens jj: & lens = maxlen bestmatch,: ,(myfam; allfamspecs[ii[*jj]]) j+: 1 ] tolookfor: bestmatch[;1] / this is what we need in a species inspeciesmatches: () j: 0 while[j < # inspeciesname inspeciesmatches,: ,intersect[inspeciesfamspecs[j]; tolookfor] j+: 1 ] kk: & 0 < #:'inspeciesmatches candfam: inspeciesmatches[kk] candidates: inspeciesname[kk] / for each one find treenodes having maximum family specifiers / solve a hitting set problem. / output the target and the parent species and what you get x: hittingset[candidates; candfam; tolookfor] interbreedneeded,: x[;0] directfracs: (#:'x[;1]) % (+/#:'x[;1]) fracs: findfrac[localtree]'x[;0] x[;0]: x[;0] ,' fracs yy: ,/fracs part: = yy[;0] uniqs: ?yy[;0] ytmp: +/'yy[part;1] ytmp%: +/ytmp zz:uniqs,' ytmp :(target; targfam; zz; x; directfracs) / target is the target species, / targfam is the set of gene families / zz is the set of pairs / from which species + percentage of species / x has three fields: source node, fractions from end species and / genes that make this up. } / finds the percent contribution of each of findfrac:{[localtree; lab] i: (shorten'localtree.label) ? lab ends: localtree.endspecies[i] f: 1 % (#ends) :ends,\: f } / candidates are candidate species / candfam are the family specifiers in target that the candidate satisfies / tolookfor are all the family specifiers in the target hittingset:{[candidates; candfam; tolookfor] / try greed (could use simulated annealing) / find the biggest set of candfam / then subtract that from the others and take the next biggest set / keep going until you've covered tolookfor / candidate name and what you get from it. out: () currentcandfam: candfam left: tolookfor while[0 < #left counts: #:'currentcandfam maxnum: |/counts i: counts ? maxnum zz: currentcandfam[i] out,: ,(candidates[i]; zz) left: differ[left; zz] currentcandfam: differ[;zz]'currentcandfam ] :out } / END OF APPLICATION groupexpand: 1: "tmpgroups" if[0 = #_i ` 0: ,"k graphlife genefamilylistinphyloformat [list of trees]" ` 0: ,"e.g. k graphlife graphlife ndhF matK trnT" . "\\\\" ] x: phyloparse[_i[0]] x: fillall'?tree.treeid treestats.outline: ($treestats.treeid),\: ("_ ") backtoall[] / copy tree and treelabel to alltree and alltreelabel / selecttotree can take a list of families and put them into the tree / table. if[1 < #_i / pick out trees from command line / one tree (or gene family) per line. x: ` $ 1 _ _i ` 0: ,spitunder ` $ _i xx: selecttotree[x] ] x: laboutline'?tree.treeid species: charspecies[] counts: #:'species species@: & (counts > 2) / x: differ[charspecies[]; species] / if[(0 < #x ) & (~x[0;0] =`hirsA); !-13] / might be a data entry error / catch data entry levels if species don't appear often enough. / ??? So we are ignoring hirsA in Ken's 2003 data by using species alone trimmedtrees: trimtotree'' 1 _' species allspecies: species / find reasons for non-tree evolution / If set to 1 then outbad will give reasons in the loop below findreason: 0 / find maximal tree sets bestmaxspecies: () bestmaxedges: () docount: 3 if[0 < #preferredorder; docount: 1] / Tries several rearrangements of species and then gets a large tree do[docount maxsets: () maxedges: () loopflag: 1 while[ loopflag availspecies: differ[species; (,/maxsets)] availspecies: rearrange[availspecies] newset: ,availspecies[0] newedges: () i: 1 while[i < #availspecies myspecies: newset, ,availspecies[i] sppairs: ,/ myspecies[;0] ,/:\: myspecies[;0] sppairs@: & sppairs[;0] < sppairs[;1] x: processpair'sppairs y:,/x if[0 < #y provenance: y[;2 3] / this just means where the / dependency was discovered. We don't expect / a cycle among the species. / However, for every pair of species associated with / an edge, we expect to see evidence for the edge. / Also, each one of the cycles should involve the / latest species. Otherwise old newset was a problem. y: y[;0 1] ] z: y, expand[y] candedges: () if[0 < #y provenance,: ((#z) - (#y)) # ,(`anc; `desc) foundedges: z candedges: y / without expansion ] allnodes: () mat: () if[0 < #z allnodes: ? (z[;0]) , (z[;1]) mat: findcycles[z] ] goodflag: 1 outbad: () j: 0 while[j < #mat if[mat[j;j] = 1 mynode: ,allnodes[j] potindexes: (& tmpmat[j] = 1) _dv j xx: j _in/: &:' tmpmat[potindexes] / is j an index in the return if[findreason & (1 = |/xx) / two edge cycle / Note that there may be longer ones. mynode,: ,allnodes[potindexes[& xx = 1]] myedges: mynode[0] ,/: mynode[1] / zz: intersectleftindexes[foundedges; myedges] zz: foundedges ?/: myedges myedges2: mynode[1],\: mynode[0] / zz2: intersectleftindexes[foundedges;myedges2] zz2: foundedges ?/: myedges2 zzz: myedges ,' provenance[zz] zzz: zzz ,' myedges2 ,' provenance[zz2] / can get the results in zzz / in a cycle outbad,: zzz ] goodflag: 0 ] j+:1 ] if[goodflag newedges,: candedges newset,: ,availspecies[i] ] / if[goodflag=0; !-2] / look at outbad; may be empty if no short cycles i+: 1 ] maxsets,: ,newset maxedges,: ,newedges loopflag: ~ subset[species; (,/maxsets)] / loopflag&: 0 = #preferredorder / if 0 < #preferredorder, then stop ] bestmaxedges,: ,maxedges maxsetspecies: findfirst'maxsets bestmaxspecies,: ,maxsetspecies ] / bestmaxspecies out: () i: 0 while[i < #bestmaxspecies x: bestmaxspecies[i] allx: () j: 0 while[j < #x y: x[j] y: y[