/ Things to do as of Dec. 2000: look at findglobalcause
/ we have made subset flexible so that x is a subset of y if
/ card(x) <= card(intersect(x,y)) + alloweddif which is a parameter.
/ k bacteria allowedif

/ WARNING: NO DUPLICATE LABELS IN EITHER MICROORGANISMS or THE ab table.

nonperfectflag: 1 
   / if set then look for nonperfects even if there is a perfect match

/ Things to do as of Feb. 11, 2000.
/ Given traits as a set of 1's and zeros with respect to a bunch of
/ attributes and genomes, translate these into a list of genomes.
/ Given genes, parse away the genome info and give back the genes.
/ If there are certain genomes not considered, we should remove those
/ from our genome list.
/ Given the results of this analysis, we get a set of indexes into genomes.
/ We want to translate this to genes.

/ >Our main goal is to correlate traits with genes.
/ >The problem there is given facts like
/ >gene x in genome A is similar to gene y in genome B
/ >and genome A has certain traits,
/ >what is a consistent mapping of traits to genes
/ >that correlates with these facts?    

/ So input data consists of: 
/ (i) a set of traits associated with each genome (variable traits)
/ (ii) a set of genes associated with each genome (variable genes)
/ (iii) a set of matches between genes of different genomes (variable matches).
/ Initially, these matches are 1-1 for every genome pair.
/ (If not, then should merge genes in a single genome that are close
/ in sequence.)
/ Our goal is to correlate traits with some boolean expression on genes.
/ We also seek a simple correlation between traits and genes.

/ Algorithm: 
/ 1: convert each gene to a number and give every gene a 
/ globally unique name.
/ 2. find connected components among 
/ the edges. 
/ Can give a new name to each connected component.
/ 3. Reduce each connected component to a set of maximal cliques.
/ The cliques are the real genes.
/ 4. If a clique's elements are in exactly the same genomes
/ as some trait, then it is a good guess that the connected
/ component represents the trait.
/ 5. If a trait is in a superset of the component, then it could be that
/ any of a union of several genes cause the trait. 
/ 6. If a trait is in a subset of a component, 
/ then  it is possible that the trait would be present
/ as a result of that component except for a repressing component.
/ Alternatively, it could be that the intersection of two or more 
/ genes is needed.
/ Our algorithm starts by computing all possible non-null intersections, then
/ differences and finally (if necessary) unions.

/ Note that It could be that there will be many traits that can result
/ from a gene and many genes that can produce a trait. 
/ Then it will be up to the biologists to untangle the result.

/ Abbreviated problem using cogs (Feb, 2000):
/ Inputs:
/ traits associated with genomes. We should find equivalent classes
/ of traits if they are in the same genomes.
/ We use = for multiple genes within a single genome in a single cog.
/ We use # for multiple genes within a single genome that are all singleton.
/	Actually, what we might do is just create a single gene for
/	each genome that is outside any cog. Called Aspecific, etc.
/ We use # for multiple genes within a single genome that all belong
/  to cogs having the same genomes.
/  For example suppose two cogs have genomes X, Y, and Z.
/  cog1: has gene x1 in X, gene y1 in Y, and gene z1 in Z.
/  cog2: has gene x2 in X, gene y2 in Y, and gene z2 in Z.
/  Then there is no way for us to distinguish the effect x1 from x2,
/  since any trait that correlates with cog1 also correlates with cog2.
/  Singletons are just a special case of this.
/ nomes.number and nomes.name translates between genome numbers and names
/ enes.number and enes.name translates between gene numbers and gene names
/	(the names may have = and # signs in them)
/ genecomps takes each gene connected component and translates it to gene
/   numbers with no two gene numbers in a component appearing twice
/ genomecomps takes each genome connected component and translates it to
/   genome numbers.
/
/ We might try a new strategy because computing 
/ all these intersections is taking way too long.
/ In this new strategy, we take a trait t and consider the genomes genomes(t)
/ it is in. We find all cogs in a superset of t.
/ If we find equality, then great, else do a multiintersection
/ of all of those on top and see if we get the answer.
/ If not, then try the multiintersection and a difference.
/ If there are no supersets, then take a union of subsets
/ We call this findcauseglobal

	
/ BASICS


/finds intersection of two lists
/ fastest of all
intersect: {[x;y]
  x,: ()
  y,: ()
  if[(#x) < (#y)
    i: x ?/: y
    j: & i < #x
    :x[?i[j]]
  ]
  i: y ?/: x
  j: & i < #y
  :y[?i[j]]
}

/finds intersection of two lists
/ fastest of all
hasintersect: {[x;y]
  x,: ()
  y,: ()
  i: x ?/: y
  : (&/i) < #x
}

/ x is a proper subset of y
propersubset:{[x;y]
  x,: ()
  y,: ()
  if[~ (#x) < (#y); :0] / must be smaller
  :(#x) = (#intersect[x;y])
}

alloweddif: 0
if[0 < #_i
   alloweddif: 0 $ _i[0]
]

/ subset:{[x;y] (#x) = (#intersect[x;y])}
subset:{[x;y] ~ (#x) > ((#intersect[x;y]) + alloweddif)}

/ could have an approximate subset

/finds indexes in x and y that intersect
/ If x and y are both sets, then the results will be of the same length
/ fastest of all
intersectindexes: {[x;y]
  i: x ?/: y / where each y hits
  j: & i < #x / those ys that hit
  :(i[j];j)
}

/finds indexes in x that intersect with y
intersectleftindexes: {[x;y]
  i: x ?/: y / where each y hits
  j: & i < #x / those ys that hit
  :i[j]
}

/finds intersection of two lists
/ and returns index pairs of matches. Assumes no duplicates
/ in either list
intersectbothindexes: {[x;y]
  x,: ()
  y,: ()
  i: x ?/: y
  pairs: (i ,' (!#y))
  k: & pairs[;0] < #x
  :pairs[k]
}

/ intersect many lists
multiintersect:{[lists]
 size: #lists
 if[2 > size; :lists]
 first: lists[0],()
 jj: ,/ ?:' first (?/:)/: lists[1+ !(size-1)] / find indexes in first
 x: @[(1+#first) # 0; jj; + ; 1]
 x: (-1) _ x / delete missing entry
 kk: & x = size - 1
 :first[kk]
}

/ this is a set intersection so we remove duplicates
multiintersect:{[lists]
 size: #lists
 if[2 > size; :lists]
 first: lists[0],()
 jj:  first ?/: (,/ ?:' lists[1+ !(size-1)]) / find indexes in first
 x: @[(1+#first) # 0; jj; + ; 1]
 x: (-1) _ x / delete missing entry
 kk: & x = size - 1
 :first[kk]
}

/ A faster difference
differ:{[x;y]
  x,: ()
  y,: ()
  i: y ?/: x
  j: & i = #y
  :?x[j]
}

/ equalset
equalset:{[x;y] (x @ < x) ~ (y @ < y)}

avg:{(+/ x) % # x}

var:{avg[_sqr x] - _sqr avg[x]}
std:{_sqrt var[x]}
cov:{avg[x * y] - avg[x] * avg[y]}
corr:{ (cov[x;y])%((std[x]) * (std[y]))}
/ delay based search

/ END OF BASICS


/ DUMP TABLE dumptable


/ formstring takes a list and makes a string
formstring:{[list]
  list,: ()
  : ,/ ($list) ,\: (" ")
}

/ Output a table (a variable) to a text file outfile (string)
/ e.g.   output[`guide; guide; "foobar"]
dumptable:{[tablename; table; outfile]
  out: ,("table: "), ($tablename)
  out,: ,("attributes: "), formstring[!table]
  first: *!table
  numofelements: . ("#"), ($tablename), ("."), ($first)
  i: 0
  while[i < numofelements
	list: table[;i]
	x: formstring'list
	out,: , (-1) _ ,/x ,\: ("|")
	i+: 1
  ]
  outfile 0: out
}


/ APPLICATION SPECIFIC


/ find the connected components of every node
/ Algorithm: compute the transitive closure.
/ Then each node goes in the component of its first 1.
/ This makes no assumptions about the matches (e.g. it's fine
/ if there is an edge 17 3.)
findconnected:{[nodes; matches]
   mat: transitive[nodes;matches]
   c.id: mat ?\: 1
   c.node: !#mat
   part: = c.id
   :(c.node[part])
}

/ logical dot product
logicdot:{[v1;v2] |/ v1*v2}


/ find transitive closure of a set of nodes and edges
/ and return a matrix.
transitive:{[nodes; matches]
  newmatches: matches
  newmatches,: |:' newmatches
  newmatches,: nodes,' nodes
  mat: ((#nodes);(#nodes)) # 0
  / establish matrix
  i: 0
  while[i < #newmatches
	mat[newmatches[i;0]; newmatches[i;1]]: 1
	i+: 1
  ]
  / compute transitive closure
  change: 1
  while[change
	oldmat: mat
	mat: mat logicdot/:\: +mat
	change: ~ oldmat ~ mat
  ]
  :mat
}

/ returns 1 if the component is a clique
testclique:{[matches; comp]
   if[(#comp) < 3; :1] / surely is a clique if few matches are there
   allpairs: ,/comp ,/:\: comp
   i: & allpairs[;0] < allpairs[;1]
   allpairs@: i
   :subset[allpairs; matches]
}


/ returns 1 list of missing edges
findbad:{[matches; comp]
   if[(#comp) < 3; :()] / surely is a clique if few matches are there
   allpairs: ,/comp ,/:\: comp
   i: & allpairs[;0] < allpairs[;1]
   allpairs@: i
   :differ[allpairs; matches]
}

/ computevalence finds the connectivity of various matches
/ all edges are from lower to upper ??? actually, it's better
/ if it's complete.
computevalence:{[matches]
  part: = matches[;0]
  :(?matches[;0]; #:' part)
}

/ inversemap finds the genome holding each gene
/ The result goes into a global variable "genemap".
/ It is sorted by geneid
/ Assumptions: genomes are identified by 0, 1, 2, ...
inversemap:{[genes] 
  genemap.geneid:: ()
  genemap.genomeid:: ()
  i: 0
  while[i < #genes
	size: #genes[i]
	genemap.geneid,: genes[i]
	genemap.genomeid,: size # i
	i+: 1
  ]
  i: < genemap.geneid
  genemap.geneid@: i
  genemap.genomeid@: i
}

/ inversemaptrait finds the genomes holding each trait
/ The result goes into a global variable "traitmap".
/ traits need not start off at zero
/ traits is a list of traits per genome -- assumed to be in some order.
/ ???? THIS MAY HAVE TO CHANGE BECAUSE OF THE ASSUMED ORDER. MUST
/ MAKE IT IN THE SAME ORDER AS nomes.name.
traitinversemap:{[traits] 
  traitmap.trait:: ()
  traitmap.genomeids:: ()
  traitmap.genomesymbs:: ()
  i: 0
  while[i < #traits
	size: #traits[i]
	traitmap.trait,: traits[i]
	traitmap.genomeids,: size # i / all are in genome i.
  	traitmap.genomesymbs,: genometab.symb[size # i]
	i+: 1
  ]
  part: = traitmap.trait
  traitmap.trait:: ? traitmap.trait
  traitmap.genomeids:: traitmap.genomeids[part]
  traitmap.genomesymbs:: traitmap.genomesymbs[part]
  / up to here, we have a given trait and all the genomes of that trait
  / the genomes are ordered (will this remain the case???)
  / Now, we unify traits that are in the same set of genomes
  / under the theory that we can't distinguish them anyway.
  part2: = traitmap.genomeids
  foo:{[list] ` $ (-1) _ ,/ ($list) ,\: ("#")}
  x: foo'traitmap.trait[part2]
  traitmap.genomeids?:
  traitmap.genomesymbs?:
  traitmap.trait:: x
}

/ take a component which is not a clique and reduce it to two
/ cliques.
fixbad:{[matches;  connect; comp]
  if[testclique[matches;comp]; : comp]
  bads: findbad[matches;comp] / like test clique except 
	/ it returns the missing edges (those that a clique would have)
  badnodes: ?,/bads
  i: intersectleftindexes[connect.node; badnodes]
  j: < connect.valence[i]
  toremove: differ[badnodes; connect.node[i[j]]], connect.node[i[j]]
  removed: ()
  newcomp: comp
  flag: 1
  while[flag
 	first: *toremove
	removed,: first
	toremove: 1 _ toremove
	newcomp: newcomp _dv first
	flag: ~ testclique[matches;newcomp]
  ]
  :(newcomp; fixbad[matches; connect; removed])
}

/ findgenome locates the genome on which a particular gene
/ is found abased on the genemap
findgenome:{[onegene] genemap.genomeid[genemap.geneid _bin onegene] }


/ Given cliques (or whatever notion of component we have)
/ and the traits, find out which traits are found on
/ the same genomes as which components.
/ The cliques come in the form of a bunch of the genome ids 
/ they are derived from.
/ Those are perfect matches. It may also be that
/ some genes produce a trait but are repressed by other genes.
/ So, some traits may correspond to a positive component - a negative
/ component.
/ Others may correspond to intersections or unions.
/ In any case, the result is always in the format:
/ (`exact;3;`"2_less_5")
/ meaning there is an exact match between the genomes of trait 3
/ and the genomes of genome component 2 less the genomes of
/ genome component 5.
findcause:{[traitmap; genomecomponents; origin]
  out: ()
  i: 0
  while[i < #traitmap.trait
	tgenomeids: traitmap.genomeids[i]
	jj1: (subset[tgenomeids]'genomecomponents) 
	jj2: (subset[;tgenomeids]'genomecomponents)
	jj: & jj1 & jj2
	if[0 < #jj 
		out,: (`exact) ,/: traitmap.trait[i] ,/: origin[jj]
	]
	if[ 0 = #jj
		kk: & jj1
		if[0 < #kk
			/ out,: (`approximate) ,/: traitmap.trait[i] ,/: kk
		]
	]
	i+: 1
  ]
  :out
}

/ given a targetset and a set of sets called the sourcesets
/ such that the intersection of the sourcesets equals the targetset,
/ we want the smallest subset of these sourcesets whose intersection
/ equals the targetset.
/ Algorithm: start with smallest set and then keep trying to reduce
/ difference as fast as possible.
findmin:{[targetset; sourcesets]
  counts: #:'sourcesets
  out: ()
  ii: < counts
  current: sourcesets[ii[0]]
  out,: ii[0]
  extra: differ[current; targetset]
  while[0 < #extra
	extracount: #:'differ[extra]'sourcesets
	mm: > extracount
	out,: *mm
	current: intersect[current; sourcesets[*mm]]
  	extra: differ[current; targetset]
  ]
  :out @ <out
}

/ same as findmin but with randomizing start point
findminrand:{[targetset; sourcesets]
  counts: #:'sourcesets
  out: ()
  ii: < counts
  nn: ii[* 1 _draw #ii]
  current: sourcesets[nn]
  out,: nn
  extra: differ[current; targetset]
  while[0 < #extra
	extracount: #:'differ[extra]'sourcesets
	mm: > extracount
	out,: *mm
	current: intersect[current; sourcesets[*mm]]
  	extra: differ[current; targetset]
  ]
  :out @ <out
}

/ given a targetset and a set of sets called the sourcesets
/ such that the union of the sourcesets equals the targetset,
/ we want the smallest subset of these sourcesets whose union
/ equals the targetset.
/ Algorithm: start with largest subset and then try to reduce
/ difference as fast as possible.
findmax:{[targetset; sourcesets]
  counts: #:'sourcesets
  out: ()
  ii: > counts
  current: sourcesets[ii[0]]
  out,: ii[0]
  extra: differ[targetset; current] / what more is needed
  while[0 < #extra
	extracount: #:'intersect[extra]'sourcesets
	highcount: |/extracount
	mm: & extracount = highcount / best sourcesets
	/ reorder to get lowest intersect with current
	currentcount: #:' intersect[current]'sourcesets[mm]
	lowcount: &/currentcount
	qi: currentcount ? lowcount
	out,: mm[qi]
	current,: sourcesets[mm[qi]]
	current?:
  	extra: differ[targetset; current]
  ]
  :out @ < out
}

findmaxrand:{[targetset; sourcesets]
  counts: #:'sourcesets
  out: ()
  ii: > counts
  nn: ii[* 1 _draw #ii]
  current: sourcesets[nn]
  out,: nn
  extra: differ[targetset; current] / what more is needed
  while[0 < #extra
	extracount: #:'intersect[extra]'sourcesets
	highcount: |/extracount
	mm: & extracount = highcount / best sourcesets
	/ reorder to get lowest intersect with current
	currentcount: #:' intersect[current]'sourcesets[mm]
	lowcount: &/currentcount
	qi: currentcount ? lowcount
	out,: mm[qi]
	current,: sourcesets[mm[qi]]
	current?:
  	extra: differ[targetset; current]
  ]
  :out @ < out
}

/ the similarity of two lists is 
/ size of intersect(list1;list2)/union(list1;list2)
findsim:{[list1; list2]
   :(#intersect[list1;list2])%(#?list1,list2)
}

/ Here we look for subsets of the trait and for supersets of the trait.
/ if we find an exact match, then we report it otherwise we report
/ approximate answers.
/ Want answers of the form: all of these cogs together are required
/ to give us this trait or any of these cogs is enough for this trait.
/ We try to give all such possible minimal answers. 
findglobalcause:{[traitmap; genomecomponents; origin]
  numattempts: 20
  maxattempts: 2000
  similarmin: 0.79
  out: ()
  i: 0
  while[i < #traitmap.trait / for each trait, get the genomeids  
	/ to get approximation, simply make subset be approximate
	tgenomeids: traitmap.genomeids[i]
	/ finding similarity
	zz: findsim[tgenomeids]'genomecomponents
	zzgoodnum: # & zz > similarmin
	zzindex: > zz
	zzindex@: !(zzgoodnum)
    	if[0 < #zzindex
	  out,: ,(`similar) , traitmap.trait[i] , ,zzindex
	]
	/ end of finding similarity
	jj1: (subset[tgenomeids]'genomecomponents) 
	indexjj1: &jj1
	jj2: (subset[;tgenomeids]'genomecomponents)
	indexjj2: & jj2
	jj: & jj1 & jj2
	if[0 < #jj 
		out,: ,(`exact) , traitmap.trait[i] , ,jj
		/ Some cog exactly matches the result.
		/ Remove the exact ones because already handled 
        	indexjj1: differ[indexjj1; jj]
        	indexjj2: differ[indexjj2; jj]
	]
	if[ ((0 = #jj) | nonperfectflag) & (0 < #indexjj1)
		x: multiintersect[genomecomponents[indexjj1]]
		multiflag: subset[x;tgenomeids]
		/ We know that x must be at least a superset
		/ since every member of x is a strict superset
		/ of tgenomeids.
		if[multiflag
			/ the two must be equal 
			/ So all of the genomecomponents[indexjj1]
			/ intersect to exactly tgenomeids (the genomes
			/ of this trait).
			/ It would be good to find the smallest set
			/ of genomecomponents[indexjj1]
			/ whose intersection equals tgenomeids.
			allmins: ()
			imin: findmin[tgenomeids; genomecomponents[indexjj1]]
				/ tries to find the smallest set
				/ whose intersection equals tgenomeids
			allmins,: ,imin
			zz: indexjj1[imin]
			out,: ,(`exactintersect) ,traitmap.trait[i], (,zz)
			sizemin: 0
			kk: 0
			step: 1
			while[(sizemin < #allmins) & (kk < maxattempts)
			 sizemin: #allmins
			 do[_ numattempts*(step^2)
			  imin2: findminrand[tgenomeids; genomecomponents[indexjj1]]
			  if[(~ imin2 _in allmins) & (~ (#imin2) > (#imin))
				allmins,: ,imin2
				zz: indexjj1[imin2]
				out,: ,(`exactintersect),traitmap.trait[i],(,zz)
			  ]
			  kk+: 1
			 ] / do statement
			 step+: 1
			] / while loop
		]
		if[~ multiflag
			out,: ,(`bigaretoobig), traitmap.trait[i], (,indexjj1)
		]
	]
	if[ ((0 = #jj) | nonperfectflag) & (0 < #indexjj2)
		x: ?,/genomecomponents[indexjj2]
		multiflag: subset[tgenomeids; x]
		if[multiflag
			/ the two must be equal 
			allmaxes: ()
			imax: findmax[tgenomeids; genomecomponents[indexjj2]]
			allmaxes,: ,imax
			zz: indexjj2[imax]
			out,: ,(`exactunion) , traitmap.trait[i], (,zz)
			sizemax: 0
			kk: 0
			step: 1
			while[(sizemax < #allmaxes) & (kk < maxattempts)
			 sizemax: #allmaxes
			 do[_ numattempts*(step^2)
			  imax2: findmaxrand[tgenomeids; genomecomponents[indexjj2]]
			  if[(~ imax2 _in allmaxes) & (~ (#imax2) > (#imax))
				allmaxes,: ,imax2
				zz: indexjj2[imax2]
				out,: ,(`exactunion) , traitmap.trait[i], (,zz)
			   ]
			   kk+: 1
			  ]
			  step+: 1
			] / no more changes, then declare victory
		]
		if[~ multiflag
			out,: ,(`smallaretoosmall),traitmap.trait[i],(,indexjj2)
		]
	]
	i+: 1
  ]
  :out
}

/ find differences of genome components and then record them.
/ A difference is interesting only if there is an intersection.
/ Again, side effect to origin
finddifs:{[]
  p: ,/(!#genomecomponents) ,/:\: (!#genomecomponents)
  i: &  p[;0] < p[;1]
  p@: i / first member is less than second
  outid: outcomp: ()
  j: 0
  while[j < #p
	g1: genomecomponents[p[j;0]]
	g2: genomecomponents[p[j;1]]
	if[hasintersect[g1;g2]
		x: differ[g1;g2]
		if[0 < #x
		  genomecomponents,: ,x
		  origin,: ` $ ($origin[p[j;0]]), ("_less_"), ($origin[p[j;1]])
		]
		x: differ[g2;g1]
		if[0 < #x
		  genomecomponents,: ,x
		  origin,: ` $ ($origin[p[j;1]]), ("_less_"), ($origin[p[j;0]])
		]
	]
	j+: 1
  ]
}

/ find intersections of genome components and then record them.
/ An intersection is interesting only if it is non-null.
/ UNUSED
findinters:{[genomecomponents]
  pairs: ,/(!#genomecomponents) ,/:\: (!#genomecomponents)
  i: &  pairs[;0] < pairs[;1]
  pairs@: i / first member is less than second
  outid: outcomp: ()
  j: 0
  while[j < #pairs
	g1: genomecomponents[pairs[j;0]]
	g2: genomecomponents[pairs[j;1]]
	if[(hasintersect[g1;g2]) & (~ subset[g1;g2]) & (~ subset[g2;g1])
		outcomp,: ,intersect[g1;g2]
		outid,: ` $ ($pairs[j;0]), ("_inter_"), ($pairs[j;1])
	]
	j+: 1
  ]
  :(outcomp; outid)
}

/ find intersections of genome components recursively and record them.
/ An intersection is interesting only if it is non-null.
/ Side effects to genomecomponents and origin
/ Only intersections are recursive as these are the most likely we think.
recurfindinters:{[]
  change: 1
  base: 0
  new: origin / residual extra
  while[change
	origsize: #origin
	change: 0
  	pairs: ,/(!#new) ,/:\: (!#origin)
  	i: &  pairs[;0] < pairs[;1]
  	pairs@: i / first member is less than second
	outid: outcomp: ()
  	j: 0
	while[j < #pairs
		g1: genomecomponents[pairs[j;0]]
		g2: genomecomponents[pairs[j;1]]
		if[(hasintersect[g1;g2]) 
		  if[(~ subset[g1;g2]) 
		    if[(~ subset[g2;g1])
		      yy: intersect[g1;g2]
			change: 1
			genomecomponents,: ,yy
			origin,: ` $ ($origin[base+pairs[j;0]]), ("_inter_"), ($origin[pairs[j;1]])
		    ]
		  ]
		]
		j+: 1
	]
	if[change
		base: origsize
		new: (#origin) - base
	]
  ]
}

/ find unions of genome components and then record them.
/ A union is always interesting unless there is only one component.
findunions:{[]
  pairs: ,/(!#genomecomponents) ,/:\: (!#genomecomponents)
  i: &   pairs[;0] < pairs[;1]
  pairs@: i / first member is less than second
  j: 0
  while[j < #pairs
	g1: genomecomponents[pairs[j;0]]
	g2: genomecomponents[pairs[j;1]]
	if[(~ subset[g1;g2]) & (~ subset[g2;g1])
	  genomecomponents,: ,?g1,g2
	  origin,: ` $ ($origin[pairs[j;0]]), ("_union_"), ($origin[pairs[j;1]])
	]
	j+: 1
  ]
}

/ translatetogene takes a result tuple whose origin expression 
/ is in terms of component ids
/ (= genome component ids) and converts it to genes
translatetogene:{[res]
  :(`traitgene; res[1]; converttogene[res[2]])
}

/ converttogene takes an expression of the form `"3_union_2_less_5"
/ and looks at the first gene of component 3, and ditto for 2 and 5 and
/ notes the result.
/ MUST CHANGE IF WE WANT TO MAP BACK TO GENE NAMES
converttogene:{[symb]
  str: $symb
  vec: & str _in\: "0123456789"
  newstr: ()
  size: #str
  i: 0
  numfound: 0
  num: ()
  while[i < size
	isin: i _in vec
	if[~ isin 
		if[numfound
			newstr,: $* components[0 $ num]
			numfound: 0
			num: ()
		]
		newstr,: str[i]
	]
	if[isin
		numfound: 1
		num,: str[i]
	]
	i+: 1
  ]
  if[numfound; newstr,: $* components[0 $ num]]
  :` $ newstr
}

 
/ EXECUTION WITH FIXED DATA


ab.symb: ()
ab.long: ()

processabbrev:{[line]
  i: line ? " "
  ab.symb,: ` $ line[!i]
  i: i+1
  ab.long,: ` $ line[i+!((#line) - i)]
}

b: ("Afu Archaeoglobus fulgidus"
"Hbs Halobacterium sp. NRC-1"
"Mac Methanosarcina acetivorans"
"Mth Methanothermobacter thermautotrophicus"
"Mja Methanococcus jannaschii"
"Mka Methanopyrus kandleri AV19"
"Tac Thermoplasma acidophilum"
"Tvo Thermoplasma volcanium"
"Pho Pyrococcus horikoshii"
"Pab Pyrococcus abyssi"
"Pya Pyrobaculum aerophilum"
"Sso Sulfolobus solfataricus"
"Ape Aeropyrum pernix"
"Sce Saccharomyces cerevisiae"
"Spo Schizosaccharomyces pombe"
"Ecu Encephalitozoon cuniculi"
"Aae Aquifex aeolicus"
"Tma Thermotoga maritima"
"Ctr Chlamydia trachomatis"
"Cpn Chlamydophila pneumoniae CWL029"
"Tpa Treponema pallidum"
"Bbu Borrelia burgdorferi"
"Syn Synechocystis"
"Nos Nostoc sp. PCC 7120"
"Fnu Fusobacterium nucleatum"
"Dra Deinococcus radiodurans"
"Cgl Corynebacterium glutamicum"
"Mtu Mycobacterium tuberculosis H37Rv"
"MtC Mycobacterium tuberculosis CDC1551"
"Mle Mycobacterium leprae"
"Cac Clostridium acetobutylicum"
"Lla Lactococcus lactis"
"Spy Streptococcus pyogenes M1 GAS"
"Spn Streptococcus pneumoniae TIGR4"
"Sau Staphylococcus aureus N315"
"Lin Listeria innocua"
"Bsu Bacillus subtilis"
"Bha Bacillus halodurans"
"Uur Ureaplasma urealyticum"
"Mpu Mycoplasma pulmonis"
"Mpn Mycoplasma pneumoniae"
"Mge Mycoplasma genitalium"
"Eco Escherichia coli K12"
"EcZ Escherichia coli O157:H7 EDL933"
"Ecs Escherichia coli O157:H7"
"Ype Yersinia pestis"
"Sty Salmonella typhimurium LT2"
"Buc Buchnera sp. APS"
"Vch Vibrio cholerae"
"Pae Pseudomonas aeruginosa"
"Hin Haemophilus influenzae"
"Pmu Pasteurella multocida"
"Xfa Xylella fastidiosa 9a5c"
"Nme Neisseria meningitidis MC58"
"NmA Neisseria meningitidis Z2491"
"Rso Ralstonia solanacearum"
"Hpy Helicobacter pylori 26695"
"jHp Helicobacter pylori J99"
"Cje Campylobacter jejuni"
"Atu Agrobacterium tumefaciens strain C58 (Cereon)"
"Sme Sinorhizobium meliloti"
"Bme Brucella melitensis"
"Mlo Mesorhizobium loti"
"Ccr Caulobacter crescentus CB15"
"Rpr Rickettsia prowazekii"
"Rco Rickettsia conorii")

processabbrev'b

localmode: 0
if[localmode = 0
/ take data from Mitch's spreadsheet format and put in positive and negative
/ traits

/ get rid of blanks at either end of the string
delendblanks:{[string]
  if[0 = #string; :""]
  if[string ~ ,"" ; :""]
  i: & ~ string = " "
  if[(#string) = (#i); :string]
  if[0 = (#i); :""]
  string: (- ((#string)  - (1 + *|i))) _ string
  :(*i) _ string
}

/ return the fields
parseline:{[line]
  i: 0, & line = ","
  cuts: i _ line
  cuts[0]: (","), cuts[0]
  : ` $ delendblanks' 1 _' cuts
}

/ figure out which traits there are
processline:{[traits; cuts]
  genometab.name,: ,cuts[0]
  / cuts: cuts _di 0  23 56
  i: & cuts = `"1"
  stuff: ` $ ("+") ,/: $traits[i]
  i: & cuts = `"0"
  stuff,: ` $ ("-") ,/: $traits[i]
  genometab.traits,: ,stuff
}
   

a: 0: "microorganisms"
traitsposs: parseline[a[0]]
/ traitsposs: traitsposs _di 0 23 56
genometab.name: ()
genometab.traits: ()
lines: processline[traitsposs]'parseline'1 _ a
	/ side effect to genometab
i1: genometab.name ? `"Methanobacterium t."
if[i1 < #genometab.name 
	genometab.name[i1]: `"Methanobacterium thermoautotrophicum"]
genometab.symb: ab.symb[ab.long ?/:  genometab.name]
i2: < genometab.symb
genometab.name@: i2
genometab.traits@: i2
genometab.symb@: i2
traits: genometab.traits / traits in the order of symbs
]
\l cog.k

if[localmode / localmode = 1 means testing with random data

/ CHANGE REQUIRED: traits must be provided in a list.
/ traits per genome.
/ INPUT data.
traits: (`t1 `t2 `t12
`t1 `t3 `t4 `t5
`t7 `t11
`t10 `t11)

/ CHANGE REQUIRED: must change gene names to gene numbers. 
/ Same gene name may have different numbers in different genes.
/ genes per genome
/ INPUT data; must be numeric starting at 0.
/ So, if real genes are in some other form, we must translate them to indexes
/ of genes.
genes: ( 0 1 
2 8
3 6 7
4 5 9)

nodes: ?,/genes

/ the matches here are for genes in different genomes
/ This can be a result of any blosum or pam matrix one wants.
/ INPUT data -- must be numeric again.
matches: (2 3
2 6 
2 4
3 4
1 2
4 6
1 3
1 6 
1 4
2 6
0 8
7 8)
] / end of localmode

/ EXECUTION
  

if[localmode / localmode = 1 means testing with random data
components: findconnected[nodes; matches] / connected components
	/ Might have gene n of genome A connected to
	/ many genes of genome B.
pairs: computevalence[matches]
connect.node: pairs[0]
connect.valence: pairs[1]

x: testclique[matches]'components / At this point testclique
	/ works on geneids. If we want to be more "liberal"
	/ we could let it work on component ids.
	/ To do that, we'd translate each component (having
	/ node ids) to a set of component ids and we'd perform
	/ our processing in testclique based on component ids.
goodcomps: & x
badcomps: & ~ x
fixedcomps: fixbad[matches; connect]' components[badcomps]
components: (components _di badcomps) , ,/ fixedcomps
inversemap[genes]
] / end of local mode
if[localmode = -1 / old way
 / Normally, we will get the traits. ???
 numgenomes: 39
 numtraits: 10 / number of traits per genome
 alltraits: ` $ ("t") ,/: $(numgenomes*numtraits) _draw (2*numtraits)
 / Traits associated with each genome.
 traits: (numgenomes; numtraits) # alltraits
 genes: genecomps
]
if[localmode = 0 / mitch's data
 genes: genecomps
]
traitinversemap[traits] / produces map from traits to genome sets


genomecomponents: :[localmode
	?:' findgenome'' components
	genomecomps]
origin: ` $ $ !#genomecomponents
/ Above, we have pure components. Below, we expand these as global variables.

if[localmode / ??? temporarily don't try this for real work
 recurfindinters[] / only intersection is recursive. The others are not.
 finddifs[genomecomponents]
 findunions[genomecomponents]
 results: findcause[traitmap; genomecomponents; origin]
]
if[0 = localmode
 results: findglobalcause[traitmap; genomecomponents; origin]
 / each line of results comes out as `howgood `trait `set of genome 
 / component indexes
 / The result is a set of genomecomponents whose intersection give
 / the result desired in the case that howgood = `exactintersect
 / or whose union gives the result desired in the case howgood = `exactunion.
 / So, we go back to coggenes of those genome components and try to present
 / the results for each genome concerned.
 / For `exactintersect, this means: 
 / find those genes having to do with the genomes in of the trait
 / and produce a table trait|genome|genes |`allneeded
 / for the union produce trait|genome|genes|`oneneeded
 / Remember: 
 / We use = for multiple genes within a single genome in a single cog.
 / We use # for multiple genes within a single genome that all belong
 / to cogs ranging over the same genomes (i.e. equivalent cogs)
 conclude.trait: ()
 conclude.genomesymb: ()
 conclude.genes: ()
 conclude.condition: ()
 conclude.genome: ()

 traitcogs.trait: ()
 traitcogs.cogs: ()
 traitcogs.condition: ()
 /  traitmap.trait:: ()
 /  traitmap.genomeids:: ()
 spitresult:{[res]
    i: traitmap.trait ? res[1]
    if[i = #traitmap.trait; !-14] / should never happen
    mygenomes: traitmap.genomeids[i]
    size: #mygenomes
    if[res[0] = `exactintersect
	conclude.trait,: size # res[1]
	conclude.genome,: genometab.name[mygenomes]
	conclude.genomesymb,: genometab.symb[mygenomes]
	conclude.condition,: size # `allneeded
	goodgenes: trim[coggenes; mygenomes; genomecomponents]'res[2]
	/ i think i should take the cogs here ???
	desiredgenes: +goodgenes
 	stringify:{[list] ` $ (-2) _ ,/ ($?list) ,\: (", ")}
	conclude.genes,: stringify'desiredgenes
	traitcogs.trait,: res[1]
	traitcogs.cogs,: ,cog[res[2]]
	traitcogs.condition,: `allneeded
    ]
    if[res[0] = `exactunion
	conclude.trait,: size # res[1]
	conclude.genome,: genometab.name[mygenomes]
	conclude.genomesymb,: genometab.symb[mygenomes]
	conclude.condition,: size # `oneneeded
	goodgenes: expand[coggenes; mygenomes; genomecomponents]'res[2]
	desiredgenes: +goodgenes
 	unionstringify:{[list] 
		list@: & ~ list = `
		x: ` $ (-2) _ ,/ ($?list) ,\: (", ")
		x}
	conclude.genes,: unionstringify'desiredgenes
	traitcogs.trait,: res[1]
	traitcogs.cogs,: ,cog[res[2]]
	traitcogs.condition,: `oneneeded
    ]
    if[res[0] = `exact
	conclude.trait,: size # res[1]
	conclude.genome,: genometab.name[mygenomes]
	conclude.genomesymb,: genometab.symb[mygenomes]
	conclude.condition,: size # `perfectmatch
	goodgenes: expand[coggenes; mygenomes; genomecomponents]'res[2]
	desiredgenes: +goodgenes
 	unionstringify:{[list] 
		list@: & ~ list = `
		x: ` $ (-2) _ ,/ ($?list) ,\: (", ")
		x}
	conclude.genes,: unionstringify'desiredgenes
	traitcogs.trait,: res[1]
	traitcogs.cogs,: ,cog[res[2]]
	traitcogs.condition,: `perfectmatch
    ]
    if[res[0] = `similar
	conclude.trait,: size # res[1]
	conclude.genome,: genometab.name[mygenomes]
	conclude.genomesymb,: genometab.symb[mygenomes]
	conclude.condition,: size # `similar
	goodgenes: expandforsim[coggenes; mygenomes; genomecomponents]'res[2]
	desiredgenes: +goodgenes
 	unionstringify:{[list] 
		list@: & ~ list = `
		x: ` $ (-2) _ ,/ ($?list) ,\: (", ")
		x}
	conclude.genes,: unionstringify'desiredgenes
	traitcogs.trait,: res[1]
	traitcogs.cogs,: ,cog[res[2]]
	traitcogs.condition,: `similar
    ]
 }

 / for each genome in targetgenomes in order, produce those genes
 / that apply. Some genomes in targetgenomes may not be in
 / genomecomplist[genomecompindex] 
 / in which case empty lists should be produced
 expand:{[coggenelist; targetgenomes; genomecomplist; genomecompindex]
	out: (#targetgenomes) # `
	i: intersectleftindexes[targetgenomes; genomecomplist[genomecompindex]]
	x: coggenelist[genomecompindex]
	cogname: cog[genomecompindex]
        y: ` $ ($ ((#i) # cogname)) ,' (": ") ,/: $x
	out[i]: y
	:out
 }

/ when doing similarity, we don't have a 1-1 correspondence between
/ genomes covered by trait and genomes covered by similar cog.
/ So, we take only those genes that are on the genomes that are
/ in the intersection. 
 expandforsim:{[coggenelist; targetgenomes; genomecomplist; genomecompindex]
	out: (#targetgenomes) # `
	i: intersectleftindexes[targetgenomes; ?genomecomplist[genomecompindex]]
	i2: intersectleftindexes[?genomecomplist[genomecompindex];targetgenomes]
	x: coggenelist[genomecompindex]
	x: x[i2]
	cogname: cog[genomecompindex]
        y: ` $ ($ ((#i) # cogname)) ,' (": ") ,/: $x
	out[i]: y
	:out
 }

 / for each genome in targetgenomes in order, produce those genes
 / that apply. Some genomes in genomecomplist[genomecompindex] may
 / not be in targetgenomes in which case they should be ignored.
 trim:{[coggenelist; targetgenomes; genomecomplist; genomecompindex]
	i: intersectleftindexes[genomecomplist[genomecompindex]; targetgenomes]
	x:coggenelist[genomecompindex;i]
	cogname: cog[genomecompindex]
	y: ` $ ($ ((#i) # cogname)) ,' (": ") ,/: $x
	:y
 }

 x: spitresult'results
 
 dumptable[`conclude; conclude; "tempconclude"]
 dumptable[`traitmap; traitmap; "temptraitmap"]
 dumptable[`traitcogs; traitcogs; "temptraitcogs"]
 / dumptable[`genometab; genometab; "tempgenometab"]
 / "output" 0: spitresult'results
 / just flattens out all the genes related to a trait but does
 / not distinguish union from intersection.
 spitresultold:{[res]
	x: ($res[0]) , ("|"), ($res[1]), ("|")
	x2: ,/coggenes[res[2]]
	/ :x ,/: $x2
	x3: (-2) _ ,/ ($x2) ,\: (", ")
	:x, x3 
 }
]
if[localmode
	/ the result it gives has to do with genome components and
	/ is in the form: 
	/ (`exact;2;`"3")
	/ (`exact;3;`"2_less_5")
	/ To find the genes that cause it, just use the genomecomponent ids
	/ as an index into components.
	/ e.g. components[2] is a cause and components[5] is a repressor
	/ for trait 3.
x: results / each line of results

/ Now find out which genes are involved
/ We characterize each connected component by its first gene.
y: translatetogene'results
]

\\


/ END OF APPLICATION SPECIFIC