/ Things to do as of Dec. 2000: look at findglobalcause / we have made subset flexible so that x is a subset of y if / card(x) <= card(intersect(x,y)) + alloweddif which is a parameter. / k bacteria allowedif / WARNING: NO DUPLICATE LABELS IN EITHER MICROORGANISMS or THE ab table. nonperfectflag: 1 / if set then look for nonperfects even if there is a perfect match / Things to do as of Feb. 11, 2000. / Given traits as a set of 1's and zeros with respect to a bunch of / attributes and genomes, translate these into a list of genomes. / Given genes, parse away the genome info and give back the genes. / If there are certain genomes not considered, we should remove those / from our genome list. / Given the results of this analysis, we get a set of indexes into genomes. / We want to translate this to genes. / >Our main goal is to correlate traits with genes. / >The problem there is given facts like / >gene x in genome A is similar to gene y in genome B / >and genome A has certain traits, / >what is a consistent mapping of traits to genes / >that correlates with these facts? / So input data consists of: / (i) a set of traits associated with each genome (variable traits) / (ii) a set of genes associated with each genome (variable genes) / (iii) a set of matches between genes of different genomes (variable matches). / Initially, these matches are 1-1 for every genome pair. / (If not, then should merge genes in a single genome that are close / in sequence.) / Our goal is to correlate traits with some boolean expression on genes. / We also seek a simple correlation between traits and genes. / Algorithm: / 1: convert each gene to a number and give every gene a / globally unique name. / 2. find connected components among / the edges. / Can give a new name to each connected component. / 3. Reduce each connected component to a set of maximal cliques. / The cliques are the real genes. / 4. If a clique's elements are in exactly the same genomes / as some trait, then it is a good guess that the connected / component represents the trait. / 5. If a trait is in a superset of the component, then it could be that / any of a union of several genes cause the trait. / 6. If a trait is in a subset of a component, / then it is possible that the trait would be present / as a result of that component except for a repressing component. / Alternatively, it could be that the intersection of two or more / genes is needed. / Our algorithm starts by computing all possible non-null intersections, then / differences and finally (if necessary) unions. / Note that It could be that there will be many traits that can result / from a gene and many genes that can produce a trait. / Then it will be up to the biologists to untangle the result. / Abbreviated problem using cogs (Feb, 2000): / Inputs: / traits associated with genomes. We should find equivalent classes / of traits if they are in the same genomes. / We use = for multiple genes within a single genome in a single cog. / We use # for multiple genes within a single genome that are all singleton. / Actually, what we might do is just create a single gene for / each genome that is outside any cog. Called Aspecific, etc. / We use # for multiple genes within a single genome that all belong / to cogs having the same genomes. / For example suppose two cogs have genomes X, Y, and Z. / cog1: has gene x1 in X, gene y1 in Y, and gene z1 in Z. / cog2: has gene x2 in X, gene y2 in Y, and gene z2 in Z. / Then there is no way for us to distinguish the effect x1 from x2, / since any trait that correlates with cog1 also correlates with cog2. / Singletons are just a special case of this. / nomes.number and nomes.name translates between genome numbers and names / enes.number and enes.name translates between gene numbers and gene names / (the names may have = and # signs in them) / genecomps takes each gene connected component and translates it to gene / numbers with no two gene numbers in a component appearing twice / genomecomps takes each genome connected component and translates it to / genome numbers. / / We might try a new strategy because computing / all these intersections is taking way too long. / In this new strategy, we take a trait t and consider the genomes genomes(t) / it is in. We find all cogs in a superset of t. / If we find equality, then great, else do a multiintersection / of all of those on top and see if we get the answer. / If not, then try the multiintersection and a difference. / If there are no supersets, then take a union of subsets / We call this findcauseglobal / BASICS /finds intersection of two lists / fastest of all intersect: {[x;y] x,: () y,: () if[(#x) < (#y) i: x ?/: y j: & i < #x :x[?i[j]] ] i: y ?/: x j: & i < #y :y[?i[j]] } /finds intersection of two lists / fastest of all hasintersect: {[x;y] x,: () y,: () i: x ?/: y : (&/i) < #x } / x is a proper subset of y propersubset:{[x;y] x,: () y,: () if[~ (#x) < (#y); :0] / must be smaller :(#x) = (#intersect[x;y]) } alloweddif: 0 if[0 < #_i alloweddif: 0 $ _i[0] ] / subset:{[x;y] (#x) = (#intersect[x;y])} subset:{[x;y] ~ (#x) > ((#intersect[x;y]) + alloweddif)} / could have an approximate subset /finds indexes in x and y that intersect / If x and y are both sets, then the results will be of the same length / fastest of all intersectindexes: {[x;y] i: x ?/: y / where each y hits j: & i < #x / those ys that hit :(i[j];j) } /finds indexes in x that intersect with y intersectleftindexes: {[x;y] i: x ?/: y / where each y hits j: & i < #x / those ys that hit :i[j] } /finds intersection of two lists / and returns index pairs of matches. Assumes no duplicates / in either list intersectbothindexes: {[x;y] x,: () y,: () i: x ?/: y pairs: (i ,' (!#y)) k: & pairs[;0] < #x :pairs[k] } / intersect many lists multiintersect:{[lists] size: #lists if[2 > size; :lists] first: lists[0],() jj: ,/ ?:' first (?/:)/: lists[1+ !(size-1)] / find indexes in first x: @[(1+#first) # 0; jj; + ; 1] x: (-1) _ x / delete missing entry kk: & x = size - 1 :first[kk] } / this is a set intersection so we remove duplicates multiintersect:{[lists] size: #lists if[2 > size; :lists] first: lists[0],() jj: first ?/: (,/ ?:' lists[1+ !(size-1)]) / find indexes in first x: @[(1+#first) # 0; jj; + ; 1] x: (-1) _ x / delete missing entry kk: & x = size - 1 :first[kk] } / A faster difference differ:{[x;y] x,: () y,: () i: y ?/: x j: & i = #y :?x[j] } / equalset equalset:{[x;y] (x @ < x) ~ (y @ < y)} avg:{(+/ x) % # x} var:{avg[_sqr x] - _sqr avg[x]} std:{_sqrt var[x]} cov:{avg[x * y] - avg[x] * avg[y]} corr:{ (cov[x;y])%((std[x]) * (std[y]))} / delay based search / END OF BASICS / DUMP TABLE dumptable / formstring takes a list and makes a string formstring:{[list] list,: () : ,/ ($list) ,\: (" ") } / Output a table (a variable) to a text file outfile (string) / e.g. output[`guide; guide; "foobar"] dumptable:{[tablename; table; outfile] out: ,("table: "), ($tablename) out,: ,("attributes: "), formstring[!table] first: *!table numofelements: . ("#"), ($tablename), ("."), ($first) i: 0 while[i < numofelements list: table[;i] x: formstring'list out,: , (-1) _ ,/x ,\: ("|") i+: 1 ] outfile 0: out } / APPLICATION SPECIFIC / find the connected components of every node / Algorithm: compute the transitive closure. / Then each node goes in the component of its first 1. / This makes no assumptions about the matches (e.g. it's fine / if there is an edge 17 3.) findconnected:{[nodes; matches] mat: transitive[nodes;matches] c.id: mat ?\: 1 c.node: !#mat part: = c.id :(c.node[part]) } / logical dot product logicdot:{[v1;v2] |/ v1*v2} / find transitive closure of a set of nodes and edges / and return a matrix. transitive:{[nodes; matches] newmatches: matches newmatches,: |:' newmatches newmatches,: nodes,' nodes mat: ((#nodes);(#nodes)) # 0 / establish matrix i: 0 while[i < #newmatches mat[newmatches[i;0]; newmatches[i;1]]: 1 i+: 1 ] / compute transitive closure change: 1 while[change oldmat: mat mat: mat logicdot/:\: +mat change: ~ oldmat ~ mat ] :mat } / returns 1 if the component is a clique testclique:{[matches; comp] if[(#comp) < 3; :1] / surely is a clique if few matches are there allpairs: ,/comp ,/:\: comp i: & allpairs[;0] < allpairs[;1] allpairs@: i :subset[allpairs; matches] } / returns 1 list of missing edges findbad:{[matches; comp] if[(#comp) < 3; :()] / surely is a clique if few matches are there allpairs: ,/comp ,/:\: comp i: & allpairs[;0] < allpairs[;1] allpairs@: i :differ[allpairs; matches] } / computevalence finds the connectivity of various matches / all edges are from lower to upper ??? actually, it's better / if it's complete. computevalence:{[matches] part: = matches[;0] :(?matches[;0]; #:' part) } / inversemap finds the genome holding each gene / The result goes into a global variable "genemap". / It is sorted by geneid / Assumptions: genomes are identified by 0, 1, 2, ... inversemap:{[genes] genemap.geneid:: () genemap.genomeid:: () i: 0 while[i < #genes size: #genes[i] genemap.geneid,: genes[i] genemap.genomeid,: size # i i+: 1 ] i: < genemap.geneid genemap.geneid@: i genemap.genomeid@: i } / inversemaptrait finds the genomes holding each trait / The result goes into a global variable "traitmap". / traits need not start off at zero / traits is a list of traits per genome -- assumed to be in some order. / ???? THIS MAY HAVE TO CHANGE BECAUSE OF THE ASSUMED ORDER. MUST / MAKE IT IN THE SAME ORDER AS nomes.name. traitinversemap:{[traits] traitmap.trait:: () traitmap.genomeids:: () traitmap.genomesymbs:: () i: 0 while[i < #traits size: #traits[i] traitmap.trait,: traits[i] traitmap.genomeids,: size # i / all are in genome i. traitmap.genomesymbs,: genometab.symb[size # i] i+: 1 ] part: = traitmap.trait traitmap.trait:: ? traitmap.trait traitmap.genomeids:: traitmap.genomeids[part] traitmap.genomesymbs:: traitmap.genomesymbs[part] / up to here, we have a given trait and all the genomes of that trait / the genomes are ordered (will this remain the case???) / Now, we unify traits that are in the same set of genomes / under the theory that we can't distinguish them anyway. part2: = traitmap.genomeids foo:{[list] ` $ (-1) _ ,/ ($list) ,\: ("#")} x: foo'traitmap.trait[part2] traitmap.genomeids?: traitmap.genomesymbs?: traitmap.trait:: x } / take a component which is not a clique and reduce it to two / cliques. fixbad:{[matches; connect; comp] if[testclique[matches;comp]; : comp] bads: findbad[matches;comp] / like test clique except / it returns the missing edges (those that a clique would have) badnodes: ?,/bads i: intersectleftindexes[connect.node; badnodes] j: < connect.valence[i] toremove: differ[badnodes; connect.node[i[j]]], connect.node[i[j]] removed: () newcomp: comp flag: 1 while[flag first: *toremove removed,: first toremove: 1 _ toremove newcomp: newcomp _dv first flag: ~ testclique[matches;newcomp] ] :(newcomp; fixbad[matches; connect; removed]) } / findgenome locates the genome on which a particular gene / is found abased on the genemap findgenome:{[onegene] genemap.genomeid[genemap.geneid _bin onegene] } / Given cliques (or whatever notion of component we have) / and the traits, find out which traits are found on / the same genomes as which components. / The cliques come in the form of a bunch of the genome ids / they are derived from. / Those are perfect matches. It may also be that / some genes produce a trait but are repressed by other genes. / So, some traits may correspond to a positive component - a negative / component. / Others may correspond to intersections or unions. / In any case, the result is always in the format: / (`exact;3;`"2_less_5") / meaning there is an exact match between the genomes of trait 3 / and the genomes of genome component 2 less the genomes of / genome component 5. findcause:{[traitmap; genomecomponents; origin] out: () i: 0 while[i < #traitmap.trait tgenomeids: traitmap.genomeids[i] jj1: (subset[tgenomeids]'genomecomponents) jj2: (subset[;tgenomeids]'genomecomponents) jj: & jj1 & jj2 if[0 < #jj out,: (`exact) ,/: traitmap.trait[i] ,/: origin[jj] ] if[ 0 = #jj kk: & jj1 if[0 < #kk / out,: (`approximate) ,/: traitmap.trait[i] ,/: kk ] ] i+: 1 ] :out } / given a targetset and a set of sets called the sourcesets / such that the intersection of the sourcesets equals the targetset, / we want the smallest subset of these sourcesets whose intersection / equals the targetset. / Algorithm: start with smallest set and then keep trying to reduce / difference as fast as possible. findmin:{[targetset; sourcesets] counts: #:'sourcesets out: () ii: < counts current: sourcesets[ii[0]] out,: ii[0] extra: differ[current; targetset] while[0 < #extra extracount: #:'differ[extra]'sourcesets mm: > extracount out,: *mm current: intersect[current; sourcesets[*mm]] extra: differ[current; targetset] ] :out @ extracount out,: *mm current: intersect[current; sourcesets[*mm]] extra: differ[current; targetset] ] :out @ counts current: sourcesets[ii[0]] out,: ii[0] extra: differ[targetset; current] / what more is needed while[0 < #extra extracount: #:'intersect[extra]'sourcesets highcount: |/extracount mm: & extracount = highcount / best sourcesets / reorder to get lowest intersect with current currentcount: #:' intersect[current]'sourcesets[mm] lowcount: &/currentcount qi: currentcount ? lowcount out,: mm[qi] current,: sourcesets[mm[qi]] current?: extra: differ[targetset; current] ] :out @ < out } findmaxrand:{[targetset; sourcesets] counts: #:'sourcesets out: () ii: > counts nn: ii[* 1 _draw #ii] current: sourcesets[nn] out,: nn extra: differ[targetset; current] / what more is needed while[0 < #extra extracount: #:'intersect[extra]'sourcesets highcount: |/extracount mm: & extracount = highcount / best sourcesets / reorder to get lowest intersect with current currentcount: #:' intersect[current]'sourcesets[mm] lowcount: &/currentcount qi: currentcount ? lowcount out,: mm[qi] current,: sourcesets[mm[qi]] current?: extra: differ[targetset; current] ] :out @ < out } / the similarity of two lists is / size of intersect(list1;list2)/union(list1;list2) findsim:{[list1; list2] :(#intersect[list1;list2])%(#?list1,list2) } / Here we look for subsets of the trait and for supersets of the trait. / if we find an exact match, then we report it otherwise we report / approximate answers. / Want answers of the form: all of these cogs together are required / to give us this trait or any of these cogs is enough for this trait. / We try to give all such possible minimal answers. findglobalcause:{[traitmap; genomecomponents; origin] numattempts: 20 maxattempts: 2000 similarmin: 0.79 out: () i: 0 while[i < #traitmap.trait / for each trait, get the genomeids / to get approximation, simply make subset be approximate tgenomeids: traitmap.genomeids[i] / finding similarity zz: findsim[tgenomeids]'genomecomponents zzgoodnum: # & zz > similarmin zzindex: > zz zzindex@: !(zzgoodnum) if[0 < #zzindex out,: ,(`similar) , traitmap.trait[i] , ,zzindex ] / end of finding similarity jj1: (subset[tgenomeids]'genomecomponents) indexjj1: &jj1 jj2: (subset[;tgenomeids]'genomecomponents) indexjj2: & jj2 jj: & jj1 & jj2 if[0 < #jj out,: ,(`exact) , traitmap.trait[i] , ,jj / Some cog exactly matches the result. / Remove the exact ones because already handled indexjj1: differ[indexjj1; jj] indexjj2: differ[indexjj2; jj] ] if[ ((0 = #jj) | nonperfectflag) & (0 < #indexjj1) x: multiintersect[genomecomponents[indexjj1]] multiflag: subset[x;tgenomeids] / We know that x must be at least a superset / since every member of x is a strict superset / of tgenomeids. if[multiflag / the two must be equal / So all of the genomecomponents[indexjj1] / intersect to exactly tgenomeids (the genomes / of this trait). / It would be good to find the smallest set / of genomecomponents[indexjj1] / whose intersection equals tgenomeids. allmins: () imin: findmin[tgenomeids; genomecomponents[indexjj1]] / tries to find the smallest set / whose intersection equals tgenomeids allmins,: ,imin zz: indexjj1[imin] out,: ,(`exactintersect) ,traitmap.trait[i], (,zz) sizemin: 0 kk: 0 step: 1 while[(sizemin < #allmins) & (kk < maxattempts) sizemin: #allmins do[_ numattempts*(step^2) imin2: findminrand[tgenomeids; genomecomponents[indexjj1]] if[(~ imin2 _in allmins) & (~ (#imin2) > (#imin)) allmins,: ,imin2 zz: indexjj1[imin2] out,: ,(`exactintersect),traitmap.trait[i],(,zz) ] kk+: 1 ] / do statement step+: 1 ] / while loop ] if[~ multiflag out,: ,(`bigaretoobig), traitmap.trait[i], (,indexjj1) ] ] if[ ((0 = #jj) | nonperfectflag) & (0 < #indexjj2) x: ?,/genomecomponents[indexjj2] multiflag: subset[tgenomeids; x] if[multiflag / the two must be equal allmaxes: () imax: findmax[tgenomeids; genomecomponents[indexjj2]] allmaxes,: ,imax zz: indexjj2[imax] out,: ,(`exactunion) , traitmap.trait[i], (,zz) sizemax: 0 kk: 0 step: 1 while[(sizemax < #allmaxes) & (kk < maxattempts) sizemax: #allmaxes do[_ numattempts*(step^2) imax2: findmaxrand[tgenomeids; genomecomponents[indexjj2]] if[(~ imax2 _in allmaxes) & (~ (#imax2) > (#imax)) allmaxes,: ,imax2 zz: indexjj2[imax2] out,: ,(`exactunion) , traitmap.trait[i], (,zz) ] kk+: 1 ] step+: 1 ] / no more changes, then declare victory ] if[~ multiflag out,: ,(`smallaretoosmall),traitmap.trait[i],(,indexjj2) ] ] i+: 1 ] :out } / find differences of genome components and then record them. / A difference is interesting only if there is an intersection. / Again, side effect to origin finddifs:{[] p: ,/(!#genomecomponents) ,/:\: (!#genomecomponents) i: & p[;0] < p[;1] p@: i / first member is less than second outid: outcomp: () j: 0 while[j < #p g1: genomecomponents[p[j;0]] g2: genomecomponents[p[j;1]] if[hasintersect[g1;g2] x: differ[g1;g2] if[0 < #x genomecomponents,: ,x origin,: ` $ ($origin[p[j;0]]), ("_less_"), ($origin[p[j;1]]) ] x: differ[g2;g1] if[0 < #x genomecomponents,: ,x origin,: ` $ ($origin[p[j;1]]), ("_less_"), ($origin[p[j;0]]) ] ] j+: 1 ] } / find intersections of genome components and then record them. / An intersection is interesting only if it is non-null. / UNUSED findinters:{[genomecomponents] pairs: ,/(!#genomecomponents) ,/:\: (!#genomecomponents) i: & pairs[;0] < pairs[;1] pairs@: i / first member is less than second outid: outcomp: () j: 0 while[j < #pairs g1: genomecomponents[pairs[j;0]] g2: genomecomponents[pairs[j;1]] if[(hasintersect[g1;g2]) & (~ subset[g1;g2]) & (~ subset[g2;g1]) outcomp,: ,intersect[g1;g2] outid,: ` $ ($pairs[j;0]), ("_inter_"), ($pairs[j;1]) ] j+: 1 ] :(outcomp; outid) } / find intersections of genome components recursively and record them. / An intersection is interesting only if it is non-null. / Side effects to genomecomponents and origin / Only intersections are recursive as these are the most likely we think. recurfindinters:{[] change: 1 base: 0 new: origin / residual extra while[change origsize: #origin change: 0 pairs: ,/(!#new) ,/:\: (!#origin) i: & pairs[;0] < pairs[;1] pairs@: i / first member is less than second outid: outcomp: () j: 0 while[j < #pairs g1: genomecomponents[pairs[j;0]] g2: genomecomponents[pairs[j;1]] if[(hasintersect[g1;g2]) if[(~ subset[g1;g2]) if[(~ subset[g2;g1]) yy: intersect[g1;g2] change: 1 genomecomponents,: ,yy origin,: ` $ ($origin[base+pairs[j;0]]), ("_inter_"), ($origin[pairs[j;1]]) ] ] ] j+: 1 ] if[change base: origsize new: (#origin) - base ] ] } / find unions of genome components and then record them. / A union is always interesting unless there is only one component. findunions:{[] pairs: ,/(!#genomecomponents) ,/:\: (!#genomecomponents) i: & pairs[;0] < pairs[;1] pairs@: i / first member is less than second j: 0 while[j < #pairs g1: genomecomponents[pairs[j;0]] g2: genomecomponents[pairs[j;1]] if[(~ subset[g1;g2]) & (~ subset[g2;g1]) genomecomponents,: ,?g1,g2 origin,: ` $ ($origin[pairs[j;0]]), ("_union_"), ($origin[pairs[j;1]]) ] j+: 1 ] } / translatetogene takes a result tuple whose origin expression / is in terms of component ids / (= genome component ids) and converts it to genes translatetogene:{[res] :(`traitgene; res[1]; converttogene[res[2]]) } / converttogene takes an expression of the form `"3_union_2_less_5" / and looks at the first gene of component 3, and ditto for 2 and 5 and / notes the result. / MUST CHANGE IF WE WANT TO MAP BACK TO GENE NAMES converttogene:{[symb] str: $symb vec: & str _in\: "0123456789" newstr: () size: #str i: 0 numfound: 0 num: () while[i < size isin: i _in vec if[~ isin if[numfound newstr,: $* components[0 $ num] numfound: 0 num: () ] newstr,: str[i] ] if[isin numfound: 1 num,: str[i] ] i+: 1 ] if[numfound; newstr,: $* components[0 $ num]] :` $ newstr } / EXECUTION WITH FIXED DATA ab.symb: () ab.long: () processabbrev:{[line] i: line ? " " ab.symb,: ` $ line[!i] i: i+1 ab.long,: ` $ line[i+!((#line) - i)] } b: ("Afu Archaeoglobus fulgidus" "Hbs Halobacterium sp. NRC-1" "Mac Methanosarcina acetivorans" "Mth Methanothermobacter thermautotrophicus" "Mja Methanococcus jannaschii" "Mka Methanopyrus kandleri AV19" "Tac Thermoplasma acidophilum" "Tvo Thermoplasma volcanium" "Pho Pyrococcus horikoshii" "Pab Pyrococcus abyssi" "Pya Pyrobaculum aerophilum" "Sso Sulfolobus solfataricus" "Ape Aeropyrum pernix" "Sce Saccharomyces cerevisiae" "Spo Schizosaccharomyces pombe" "Ecu Encephalitozoon cuniculi" "Aae Aquifex aeolicus" "Tma Thermotoga maritima" "Ctr Chlamydia trachomatis" "Cpn Chlamydophila pneumoniae CWL029" "Tpa Treponema pallidum" "Bbu Borrelia burgdorferi" "Syn Synechocystis" "Nos Nostoc sp. PCC 7120" "Fnu Fusobacterium nucleatum" "Dra Deinococcus radiodurans" "Cgl Corynebacterium glutamicum" "Mtu Mycobacterium tuberculosis H37Rv" "MtC Mycobacterium tuberculosis CDC1551" "Mle Mycobacterium leprae" "Cac Clostridium acetobutylicum" "Lla Lactococcus lactis" "Spy Streptococcus pyogenes M1 GAS" "Spn Streptococcus pneumoniae TIGR4" "Sau Staphylococcus aureus N315" "Lin Listeria innocua" "Bsu Bacillus subtilis" "Bha Bacillus halodurans" "Uur Ureaplasma urealyticum" "Mpu Mycoplasma pulmonis" "Mpn Mycoplasma pneumoniae" "Mge Mycoplasma genitalium" "Eco Escherichia coli K12" "EcZ Escherichia coli O157:H7 EDL933" "Ecs Escherichia coli O157:H7" "Ype Yersinia pestis" "Sty Salmonella typhimurium LT2" "Buc Buchnera sp. APS" "Vch Vibrio cholerae" "Pae Pseudomonas aeruginosa" "Hin Haemophilus influenzae" "Pmu Pasteurella multocida" "Xfa Xylella fastidiosa 9a5c" "Nme Neisseria meningitidis MC58" "NmA Neisseria meningitidis Z2491" "Rso Ralstonia solanacearum" "Hpy Helicobacter pylori 26695" "jHp Helicobacter pylori J99" "Cje Campylobacter jejuni" "Atu Agrobacterium tumefaciens strain C58 (Cereon)" "Sme Sinorhizobium meliloti" "Bme Brucella melitensis" "Mlo Mesorhizobium loti" "Ccr Caulobacter crescentus CB15" "Rpr Rickettsia prowazekii" "Rco Rickettsia conorii") processabbrev'b localmode: 0 if[localmode = 0 / take data from Mitch's spreadsheet format and put in positive and negative / traits / get rid of blanks at either end of the string delendblanks:{[string] if[0 = #string; :""] if[string ~ ,"" ; :""] i: & ~ string = " " if[(#string) = (#i); :string] if[0 = (#i); :""] string: (- ((#string) - (1 + *|i))) _ string :(*i) _ string } / return the fields parseline:{[line] i: 0, & line = "," cuts: i _ line cuts[0]: (","), cuts[0] : ` $ delendblanks' 1 _' cuts } / figure out which traits there are processline:{[traits; cuts] genometab.name,: ,cuts[0] / cuts: cuts _di 0 23 56 i: & cuts = `"1" stuff: ` $ ("+") ,/: $traits[i] i: & cuts = `"0" stuff,: ` $ ("-") ,/: $traits[i] genometab.traits,: ,stuff } a: 0: "microorganisms" traitsposs: parseline[a[0]] / traitsposs: traitsposs _di 0 23 56 genometab.name: () genometab.traits: () lines: processline[traitsposs]'parseline'1 _ a / side effect to genometab i1: genometab.name ? `"Methanobacterium t." if[i1 < #genometab.name genometab.name[i1]: `"Methanobacterium thermoautotrophicum"] genometab.symb: ab.symb[ab.long ?/: genometab.name] i2: < genometab.symb genometab.name@: i2 genometab.traits@: i2 genometab.symb@: i2 traits: genometab.traits / traits in the order of symbs ] \l cog.k if[localmode / localmode = 1 means testing with random data / CHANGE REQUIRED: traits must be provided in a list. / traits per genome. / INPUT data. traits: (`t1 `t2 `t12 `t1 `t3 `t4 `t5 `t7 `t11 `t10 `t11) / CHANGE REQUIRED: must change gene names to gene numbers. / Same gene name may have different numbers in different genes. / genes per genome / INPUT data; must be numeric starting at 0. / So, if real genes are in some other form, we must translate them to indexes / of genes. genes: ( 0 1 2 8 3 6 7 4 5 9) nodes: ?,/genes / the matches here are for genes in different genomes / This can be a result of any blosum or pam matrix one wants. / INPUT data -- must be numeric again. matches: (2 3 2 6 2 4 3 4 1 2 4 6 1 3 1 6 1 4 2 6 0 8 7 8) ] / end of localmode / EXECUTION if[localmode / localmode = 1 means testing with random data components: findconnected[nodes; matches] / connected components / Might have gene n of genome A connected to / many genes of genome B. pairs: computevalence[matches] connect.node: pairs[0] connect.valence: pairs[1] x: testclique[matches]'components / At this point testclique / works on geneids. If we want to be more "liberal" / we could let it work on component ids. / To do that, we'd translate each component (having / node ids) to a set of component ids and we'd perform / our processing in testclique based on component ids. goodcomps: & x badcomps: & ~ x fixedcomps: fixbad[matches; connect]' components[badcomps] components: (components _di badcomps) , ,/ fixedcomps inversemap[genes] ] / end of local mode if[localmode = -1 / old way / Normally, we will get the traits. ??? numgenomes: 39 numtraits: 10 / number of traits per genome alltraits: ` $ ("t") ,/: $(numgenomes*numtraits) _draw (2*numtraits) / Traits associated with each genome. traits: (numgenomes; numtraits) # alltraits genes: genecomps ] if[localmode = 0 / mitch's data genes: genecomps ] traitinversemap[traits] / produces map from traits to genome sets genomecomponents: :[localmode ?:' findgenome'' components genomecomps] origin: ` $ $ !#genomecomponents / Above, we have pure components. Below, we expand these as global variables. if[localmode / ??? temporarily don't try this for real work recurfindinters[] / only intersection is recursive. The others are not. finddifs[genomecomponents] findunions[genomecomponents] results: findcause[traitmap; genomecomponents; origin] ] if[0 = localmode results: findglobalcause[traitmap; genomecomponents; origin] / each line of results comes out as `howgood `trait `set of genome / component indexes / The result is a set of genomecomponents whose intersection give / the result desired in the case that howgood = `exactintersect / or whose union gives the result desired in the case howgood = `exactunion. / So, we go back to coggenes of those genome components and try to present / the results for each genome concerned. / For `exactintersect, this means: / find those genes having to do with the genomes in of the trait / and produce a table trait|genome|genes |`allneeded / for the union produce trait|genome|genes|`oneneeded / Remember: / We use = for multiple genes within a single genome in a single cog. / We use # for multiple genes within a single genome that all belong / to cogs ranging over the same genomes (i.e. equivalent cogs) conclude.trait: () conclude.genomesymb: () conclude.genes: () conclude.condition: () conclude.genome: () traitcogs.trait: () traitcogs.cogs: () traitcogs.condition: () / traitmap.trait:: () / traitmap.genomeids:: () spitresult:{[res] i: traitmap.trait ? res[1] if[i = #traitmap.trait; !-14] / should never happen mygenomes: traitmap.genomeids[i] size: #mygenomes if[res[0] = `exactintersect conclude.trait,: size # res[1] conclude.genome,: genometab.name[mygenomes] conclude.genomesymb,: genometab.symb[mygenomes] conclude.condition,: size # `allneeded goodgenes: trim[coggenes; mygenomes; genomecomponents]'res[2] / i think i should take the cogs here ??? desiredgenes: +goodgenes stringify:{[list] ` $ (-2) _ ,/ ($?list) ,\: (", ")} conclude.genes,: stringify'desiredgenes traitcogs.trait,: res[1] traitcogs.cogs,: ,cog[res[2]] traitcogs.condition,: `allneeded ] if[res[0] = `exactunion conclude.trait,: size # res[1] conclude.genome,: genometab.name[mygenomes] conclude.genomesymb,: genometab.symb[mygenomes] conclude.condition,: size # `oneneeded goodgenes: expand[coggenes; mygenomes; genomecomponents]'res[2] desiredgenes: +goodgenes unionstringify:{[list] list@: & ~ list = ` x: ` $ (-2) _ ,/ ($?list) ,\: (", ") x} conclude.genes,: unionstringify'desiredgenes traitcogs.trait,: res[1] traitcogs.cogs,: ,cog[res[2]] traitcogs.condition,: `oneneeded ] if[res[0] = `exact conclude.trait,: size # res[1] conclude.genome,: genometab.name[mygenomes] conclude.genomesymb,: genometab.symb[mygenomes] conclude.condition,: size # `perfectmatch goodgenes: expand[coggenes; mygenomes; genomecomponents]'res[2] desiredgenes: +goodgenes unionstringify:{[list] list@: & ~ list = ` x: ` $ (-2) _ ,/ ($?list) ,\: (", ") x} conclude.genes,: unionstringify'desiredgenes traitcogs.trait,: res[1] traitcogs.cogs,: ,cog[res[2]] traitcogs.condition,: `perfectmatch ] if[res[0] = `similar conclude.trait,: size # res[1] conclude.genome,: genometab.name[mygenomes] conclude.genomesymb,: genometab.symb[mygenomes] conclude.condition,: size # `similar goodgenes: expandforsim[coggenes; mygenomes; genomecomponents]'res[2] desiredgenes: +goodgenes unionstringify:{[list] list@: & ~ list = ` x: ` $ (-2) _ ,/ ($?list) ,\: (", ") x} conclude.genes,: unionstringify'desiredgenes traitcogs.trait,: res[1] traitcogs.cogs,: ,cog[res[2]] traitcogs.condition,: `similar ] } / for each genome in targetgenomes in order, produce those genes / that apply. Some genomes in targetgenomes may not be in / genomecomplist[genomecompindex] / in which case empty lists should be produced expand:{[coggenelist; targetgenomes; genomecomplist; genomecompindex] out: (#targetgenomes) # ` i: intersectleftindexes[targetgenomes; genomecomplist[genomecompindex]] x: coggenelist[genomecompindex] cogname: cog[genomecompindex] y: ` $ ($ ((#i) # cogname)) ,' (": ") ,/: $x out[i]: y :out } / when doing similarity, we don't have a 1-1 correspondence between / genomes covered by trait and genomes covered by similar cog. / So, we take only those genes that are on the genomes that are / in the intersection. expandforsim:{[coggenelist; targetgenomes; genomecomplist; genomecompindex] out: (#targetgenomes) # ` i: intersectleftindexes[targetgenomes; ?genomecomplist[genomecompindex]] i2: intersectleftindexes[?genomecomplist[genomecompindex];targetgenomes] x: coggenelist[genomecompindex] x: x[i2] cogname: cog[genomecompindex] y: ` $ ($ ((#i) # cogname)) ,' (": ") ,/: $x out[i]: y :out } / for each genome in targetgenomes in order, produce those genes / that apply. Some genomes in genomecomplist[genomecompindex] may / not be in targetgenomes in which case they should be ignored. trim:{[coggenelist; targetgenomes; genomecomplist; genomecompindex] i: intersectleftindexes[genomecomplist[genomecompindex]; targetgenomes] x:coggenelist[genomecompindex;i] cogname: cog[genomecompindex] y: ` $ ($ ((#i) # cogname)) ,' (": ") ,/: $x :y } x: spitresult'results dumptable[`conclude; conclude; "tempconclude"] dumptable[`traitmap; traitmap; "temptraitmap"] dumptable[`traitcogs; traitcogs; "temptraitcogs"] / dumptable[`genometab; genometab; "tempgenometab"] / "output" 0: spitresult'results / just flattens out all the genes related to a trait but does / not distinguish union from intersection. spitresultold:{[res] x: ($res[0]) , ("|"), ($res[1]), ("|") x2: ,/coggenes[res[2]] / :x ,/: $x2 x3: (-2) _ ,/ ($x2) ,\: (", ") :x, x3 } ] if[localmode / the result it gives has to do with genome components and / is in the form: / (`exact;2;`"3") / (`exact;3;`"2_less_5") / To find the genes that cause it, just use the genomecomponent ids / as an index into components. / e.g. components[2] is a cause and components[5] is a repressor / for trait 3. x: results / each line of results / Now find out which genes are involved / We characterize each connected component by its first gene. y: translatetogene'results ] \\ / END OF APPLICATION SPECIFIC