/ Code author: Dennis Shasha, 2000, 2001, 2002, 2003.
/ Joint work with Tsong-Li Wang and Kaizhong Zhang.
/ pathfix.k
/  k pathfix +q treeincaps +d treeindb
/  k pathfix +q treeincaps2 +d treeindb
/   k pathfix +q treeincaps3 +d treeindb +dumpmatch

/ To do: Have a hash table of size 998 (mod 997, which is a prime).
/ Take labels and hash them. Then take pairs of labels and hash them.
/ Create two tables: hashval is the index and then we have countlist and 
/ treelist. Call ths file treecount.
/ Then we have tree and group list. Call this file treegroup.
/ So, then we can have a pathfilter program that will take a query
/ and will spit out pathfix queries for each tree.
/ 
/ Second program that creates a single hash them.
/ March 2002: improve hash function
/ September, 2001: eliminate the preorder and postorder filter. 
/ That depends on the position which we don't want.
/ April, 2001: in order to allow multiple children of a node with the same
/ label, just do a post-processing check on the tree that results. Still to do.
/ March 12, 2001: does a dumpmatch even in the case that we have don't cares
/ March 12, 2001: multiple levels of don't cares starting at the root.
/ Algorithm is to observe the first don't care at the root and
/ make that a partition. All other don't cares are eliminated.
/ The non-don't care part of the tree is partitioned into maximally sized
/ subtrees.
/ March 8 to do: dump matches.
/ For * or ? in the root of the query tree, perhaps split the problem
/ into subtrees at the very beginning by allowing any node in the data tree
/ to match the root and then go on.
/ March 8, 2001 solution: treat the root specially as a partition and have
/ it match anything. As of now, we cannot handle a dont care at the root
/ followed by a don't care below. 
/ March 7, 2001: fixed vldc stuff so that it works even if there
/ are multiple vldcs one on top of another.
/ Optimization flag: do some preliminary filtering to reduce the time comparing
/ the suffix tree. February 2001.
optflag: 0 / I don't know whether this helps, please toggle on and off.
/ 
/ 
/ February 2001
/ Variable length don't cares (*) fixed length don't cares 
/ (?)  to support XPATH.
/ For fixed length don't cares, if you have no *, then
/ do a subsititution for the ? in place and verify length.
/ When it finds a match, it reports whether the match is to a root or not.
/ Algorithm: partition the query tree into connected subtrees 
/ having no don't cares (so if a don't care is the parent of several
/ children, each child is the root of its own tree).
/  Match each of those. Then glue them together
/ to see where they match in the data tree. Gluing consists of 
/ of the following: A node x in the data tree is the root of a subtree
/ that matches the query tree if the following holds:
/ The partition of the query tree containing the root of the query tree
/ matches the data tree and the root of the query tree r_all maps to x.
/ (This works if r_all is not a don't care.)
/ Consider the path p from 
/ a non-root subtree root r_sub in the query tree to 
/ r_all. Now r_sub maps to possibly many nodex x_sub_1, x_sub_2 etc.
/ in the data tree. There should be at least one, say x_sub_j such that
/ the path from x_sub_j to x matches (with * and ?) the path from r_sub
/ to r_all.
/ Pictures we need: query tree with vldcs and ?.
/ Partitions of that query tree.
/ Matches of that query tree in the data tree.
/ Paths from root of a partition (r_sub) to the root of the query tree.
/ Corresponding path from root of a data  tree (x_sub_j) to a node x in
/ data tree to which r_all maps.
/ Just a note:
/ Xpath syntax a/b/c -- if anywhere.  /a/b/c if starting at root.
/ a/b//c  means variable length don't care between b and c.
/ a/b[d/e/f]/g  for trees.
/ Output goes to data.out

/ To do: handle multiple query trees at once. Done January 2001.

/ This is a second way to search for trees inside of other trees.
/ The path fix techniques entails an encoding of all suffixes
/ of all paths and uses a suffix array.
/ This one answers queries slower (proportional to the number
/ of data trees, at least partially), but requires only linear storage.
/ To do: Which nodes in the target tree actually match?
/ At this point, each line in the output is a different tree.

/ 9/9/00: superoptimum turns out to be worse than just the clever point
/ that there is no reason to test two strings for a length longer than
/ the longest path length.

/ Another other approach (starting with an idea of Divesh) is to
/ make use of the fact that if we take the postorder and preorder
/ traversal of a query tree and each data tree, then
/ we can determine whether the query tree is "inside" the data tree
/ (order preserving way) 
/ by seeing whether both the postorder and preorder traversals
/ of the query tree are gapped (i.e. not necessarily consecutive) substrings
/ of the data tree.
/ e.g. abc is a gapped substring of axybzwxcdef
/ The trouble is that this measure by itself allows the possibility
/ that a two level tree:
/ a
/  b
/  c
/ having postorder traversal: bca and preorder: abc
/ can match the following for example
/ a
/  v
/   b
/  c
/ having postorder traversal: bvca and preorder: avbc
/ So, it's as if there are arbitrary don't cares in the downward
/ arcs in the data tree.
/ To refine this, we extend this idea to find good candidates
/ for roots of subtrees within the data tree that may 
/ exactly match the query tree.
/ The basic idea is to find those roots that contain all of the
/ data tree nodes as descendants.
/ Then there is a path comparison of the final tree.
/ Actually we have stopped implementation of the Divesh algorithm.

/ Sept 2000: Construct a suffix array of every path in each tree.
/ Take each data tree T,
/ Find all the root to leaf paths in T call them P_T
/ Then concatenate those paths with a # in between (not in any of the labels)
        / yielding C_T
/ Then form of a suffix array from C_T and then search on that.
/ The space is quadratic in the size of T.    
/ The suffix array construction is
/ made more efficient by comparing only up to the maximum length
/ path (since comparisons involving the delimiter character #
/ are meaningless).
/ Other techniques are possible, e.g. using a smaller length and
/ then correcting after the fact.

/ TIME TESTING
/ \l time 


/ NEW STUFF -- small data approach

/ return all the paths for this tree id starting at a root
/ In detail: each path is formed; let us say it is of length k.
/ Then there will be k node ids one for each starting point. 
/ e.g. if there is a path from root to leaf in T consisting of letters
/ a b c and going through nodes 0 3 5, then the result of the path
/ will be (a, b, c), (T_a, T_b, T_c).
/ Then the suffix paths will be `c T_c, `bc T_b, `abc T_a.
producepathsfromnode:{[treeid; root]
  i: & tree.treeid = treeid
  parent: tree.parentid[i]
  children: tree.childrenid[i]
  j: & treelabel.treeid = treeid
  labelnodeids: treelabel.nodeid[j] 
  labellabels: treelabel.label[j] 
  paths: extendpath[root; parent; children]
  labels: converttolabel[labellabels; labelnodeids]'paths
  :labels
}

/ still use producepaths[treeid] and just use the labels which
/ are the first argument

/ END NEW STUFF


/ CREATE POSTORDER

treepost.treeid: ()
treepost.postid: ()
treepost.nodeid: ()
treepost.label: ()

treepre.treeid: ()
treepre.preid: ()
treepre.nodeid: ()
treepre.label: ()

/ for each node give its preorder number and postorder number and level
/ where 0 is the root
treestats.treeid: ()
treestats.nodeid: ()
treestats.preid: ()
treestats.postid: ()
treestats.level: ()
treestats.parent: ()
treestats.label: ()


treestatsdb.treeid: ()
treestatsdb.nodeid: ()
treestatsdb.preid: ()
treestatsdb.postid: ()
treestatsdb.level: ()
treestatsdb.parent: ()
treestatsdb.label: ()


/ fill the treestats table.
/ May not need this because we are going to try a different strategy
/ whereby we look at roots and see whether the paths between them
/ are the right ones.
fillall:{[treeid]
  i: & tree.treeid = treeid
  parent: tree.parentid[i]
  children: tree.childrenid[i]
  j: & treelabel.treeid = treeid
  labelnodeids: treelabel.nodeid[j]
  labellabels: treelabel.label[j]
  treestats.treeid,: (#j) # treeid
  z: treelabel.nodeid[j]
  treestats.label,: labellabels[<z]
  z@: <z
  treestats.nodeid,: z
  myroot: * differ[parent;,/children]
  y: findpost[myroot; parent; children] 
	/ order of nodes according to postorder
  treestats.postid,: y ?/: z
	/ e.g. suppose we have nodes with labels 0, 10, 20, 40
	/ and in postorder numbering 40 is first then 20 then 10 then 0.
	/ So, we would want 40 to have postid 0, 20 postid 1 etc.
	/ Now z are the labels in nodeid order 0 10 20 40
	/ y are the labels in postorder 40 20 10 0
	/ display accoring to postorder
  y: findpre[myroot; parent; children] / order nodes in their preorder
  treestats.preid,: y ?/: z
  pair: findlevel[0; myroot; parent; children]
  x1: pair[0]
  x2: pair[1]
  treestats.level,: x2[< x1]
  treestats.parent,: finddad[parent; children;z] / root is -1
}

/ finds parents of all nodes having parents.
finddad:{[parent; children; nodes]
  counts: #:'children
  dads: ,/counts #' parent
  dads,: (-1) / need this for root
  kids: ,/children
  i: kids ?/: nodes
  :dads[i]
}


/ findpost produces a postorder numbering
/ Does not require that initial nodes are in preorder.
findpost:{[parentid; parent; children]
  j: parent ? parentid
  if[j = #parent; :,parentid]
  kids: children[j]
  out: ()
  i: 0
  while[i < #kids
	out,: findpost[kids[i]; parent; children]
	i+: 1
  ]
  out,: parentid
  :out
}

/ the root
findroot:{[treeid]
  i: & (treestats.treeid = treeid) & (treestats.level = 0)
  if[~ 1 = #i; !-4] / bad tree format; perhaps two trees have the same name.
  :treestats.nodeid[*i]
}

/ given a string like "treex_19", determine whether 19 is a root
/ of treex
determineroot:{[treestring]
	x: |treestring
	i: x ? ("_")
	node: 0 $ | x[!i]
	tree: ` $ | (i+1) _ x
	out: :[findroot[tree] = node
		treestring, (" (root)")
		treestring, (" (nonroot)")]
	:out
}


/ findpre produces a preorder numbering
findpre:{[parentid; parent; children]
  j: parent ? parentid
  if[j = #parent; :,parentid]
  kids: children[j]
  out: ()
  out,: parentid
  i: 0
  while[i < #kids
	out,: findpre[kids[i]; parent; children]
	i+: 1
  ]
  :out
}

/ findlevel produces a level for each node
findlevel:{[level; parentid; parent; children]
  j: parent ? parentid
  if[j = #parent; :(parentid;level)]
  kids: children[j]
  outnode: ()
  outlevel: ()
  outnode,: parentid
  outlevel,: level
  i: 0
  while[i < #kids
	pair:  findlevel[level+1; kids[i]; parent; children]
	outnode,: pair[0]
	outlevel,: pair[1]
	i+: 1
  ]
  :(outnode; outlevel)
}

/ fill the postorder table.
/ unused
fillpost:{[treeid]
  i: & tree.treeid = treeid
  parent: tree.parentid[i]
  children: tree.childrenid[i]
  j: & treelabel.treeid = treeid
  labelnodeids: treelabel.nodeid[j]
  labellabels: treelabel.label[j]
  treepost.treeid,: (#j) # treeid
  treepost.postid,: !(#j)
  myroot: * differ[parent;children]
  y: findpost[myroot; parent; children] / order preorder nodes in their
	/ display accoring to postorder
  treepost.nodeid,: y / original labeling of these nodes
  treepost.label,: labellabels[labelnodeids ?/: y]
}


/ fill the preorder table.
/ unused
fillpre:{[treeid]
  i: & tree.treeid = treeid
  parent: tree.parentid[i]
  children: tree.childrenid[i]
  j: & treelabel.treeid = treeid
  labelnodeids: treelabel.nodeid[j]
  labellabels: treelabel.label[j]
  treepre.treeid,: (#j) # treeid
  treepre.preid,: !(#j)
  myroot: * differ[parent;children]
  y: findpre[myroot; parent; children] / order nodes in their preorder
  treepre.nodeid,: y / original labeling of these nodes
  treepre.label,: labellabels[labelnodeids ?/: y]
}

/ returns 1 if there is a vldc and 0 otherwise.
/ Includes a simple ? as well.
containsvldc:{[treeid]
  ilab: & treelabel.treeid = treeid
  :((`"*") _in treelabel.label[ilab]) | ((`"?") _in treelabel.label[ilab])
}

/ END OF PREORDER


/ USE PREORDER AND POSTORDER TO DETERMINE THAT SOME NODE IS BELOW ANOTHER

/ Use this section to determine if all the trees that match come together
/ properly.
/ 1. We are given the roots in query tree and their associated matches
/ in target trees.
/ 2. In a tree having vldcs, a match to a target tree x is good
/   if the matching query tree roots are in the same order
/   as their corresponding tree roots in x, using both postorder
/   and preorder.
/ Breakdown: find preorder of a given sequence of nodes; find postorder
/ numbers of a given sequence of nodes for a tree.
/ Determine whether two sequences are in the same relative order. 
/ Since there may be many matches, we want to avoid working
/ on the cartesian product.

/ Determine whether subtrees combine properly by examining paths.
/ For a given query tree qtree
/ we will have a sequence of roots qroots.
/ Associated with each qroot we will have a list of target trees
/ and root target node ids (a given target tree may
/ appear many times in a given list since for a particular qroot,
/ the same target tree may occur many times)
/ Our job is to determine which target trees match qtree
/ at all qroots and in the relationship.
/ Algorithm: 
/  see which target trees match at all qroots (i don't know how
/	to handle thresholds yet, not implemented)
/  filter based on preorders and postorders (not clear whether so useful)
/  for each such target tree try to see whether labels match on paths
/  between roots.
/ Assume qroots[0] is the root of qtree and qtree has the vldcs.
checkformatch:{[qtree; qroots; targtreeslist; targrootslist]
   out: ()
   numqroots: #qroots
   / Find targ trees for each qroot (list for each one)
   / that appear often enough
   x: ?:' targtreeslist
   y: multiintersect[x] / good ones
   if[0 = #y; :(();())] / no match
   if[1 = #qroots
	/ only one path, so can be quick
	z:(y;*targrootslist)
	if[dumpmatchflag
	   dontcareroots,:,` $ ($*y) ,/: ("_") ,/: ($*targrootslist)
	]
	:z
   ]
   i: 0
   while[i < numqroots
	/ jj: intersectleftindexes_multi[targtreeslist[i]; ,/y]
	jj: intersectleftindexes_multi[targtreeslist[i]; y]
	targtreeslist[i]@: jj
	targrootslist[i]@: jj
	i+: 1
   ]
   qrootspre: converttopre[qtree; qroots]
   qrootspost: converttopost[qtree; qroots]
   / Done with finding target trees for each qroot that appear often enough
   foundroots::()
   x2: & goodmatch[qtree; qroots; qrootspre; qrootspost; targtreeslist; targrootslist]'y 
   :(y[x2]; foundroots[x2])
}

/ determines if some tree matches 
/ Assume that qroots[0] is the root of qtree.
/ First filter out based on level,
/ then (now eliminated) preorder and postorder (eliminated because
/ it depends on sibling order).
/ Then trace each root associated with qroots[i] where i > 0
/ up to qroots[0]. 
goodmatchold:{[qtree; qroots; qrootspre; qrootspost; targtreeslist; targrootslist; targtree]
    roots: ()
    levels: ()
    relroots_pre: ()
    relroots_post: ()
    i: 0
    / for each qroot, find roots from tree targtree and convert to pre and post
    while[i < #qrootspre 
	jj: & targtreeslist[i] = targtree
	x: targrootslist[i][jj]
	roots,: ,x
	levels,: ,converttolevel[targtree; x]
	relroots_pre,: ,converttopre[targtree;x]
	relroots_post,: ,converttopost[targtree;x]
	i+: 1
    ]
    / filter out roots that can't be above others
    / So a root level can survive only if it is above at least one root
    / of each of the other trees.
    rootfilter:{[levels]
	rootlevs: levels[0]
	otherlevs: 1 _ levels
	highestother: &/ |/' otherlevs / take lowest of each subtree
		/ and highest of the lowests
	x:& rootlevs < highestother
	:x
    }
    jj: rootfilter[levels] / get rid of roots that are impossible
	/ trying to cut down cartesian product
    relroots_pre[0]@: jj
    relroots_post[0]@: jj
    roots[0]@: jj 
	/ End of levels part
   if[0 / preorder and postorder stuff is wrong as it depends on sibling order
    / first match preorder of each cartesian product
    cart_pre: constructcartprod[relroots_pre]
    goodones: & sameorder[qrootspre]'cart_pre / this is no good as it
	/ depends on the order of the paths
    if[0 = #goodones; :0]
    / now if necessary match the postorders
    cart_post: constructcartprod[relroots_post]
    jj: & sameorder[qrootspost]'cart_post[goodones]
	/ need only check those that match the preorder
   ] / cut out preorder and postorder stuff
    cart_root: constructcartprod[roots]
    / cart_root@: goodones[jj] / cut out preorder and postorder stuff
    / reorder these so first found start at root of targtree
    troot: findroot[targtree]
    mm: & cart_root[;0] = troot
    if[0 < #mm / roots first
	x: cart_root[mm]
	y: cart_root _di mm
	cart_root: x,y
    ]
    finalroots: ()
    i: 0
    / Now you want to match paths from each root of a subtree in the
    / data tree to some root in the data.
    / match holds a pair: query tree subtree root, target tree subtree root
    / if the path from the former to qroots[0] corresponds to the path
    / from the latter to its zero.
    match.qstart::() 
    match.tstart::()
    match.tend::()
    while[i < #cart_root
	flag: comparepath[qtree; qroots; targtree; cart_root[i]]
	if[flag 
		foundroots,: * cart_root[i],() 
      		if[dumpmatchflag
		   dontcareroots,:,` $ ($targtree) ,/: ("_") ,/: ($cart_root[i])
		]
		:1
	]
	i+: 1
    ]
    foundroots,: -33
    :0
}

/ determines if some tree matches 
/ Assume that qroots[0] is the root of qtree.
/ First filter out based on level,
/ then (now eliminated) preorder and postorder (eliminated because
/ it depends on sibling order).
/ Then trace each root associated with qroots[i] where i > 0
/ up to qroots[0]. 
goodmatch:{[qtree; qroots; qrootspre; qrootspost; targtreeslist; targrootslist; targtree]
    roots: ()
    levels: ()
    i: 0
    / for each qroot, find roots from tree targtree and convert to pre and post
    while[i < #qrootspre 
	jj: & targtreeslist[i] = targtree
	x: targrootslist[i][jj]
	roots,: ,x
	levels,: ,converttolevel[targtree; x]
	i+: 1
    ]
    / filter out roots that can't be above others
    / So a root level can survive only if it is above at least one root
    / of each of the other trees.
    rootfilter:{[levels]
	rootlevs: levels[0]
	otherlevs: 1 _ levels
	highestother: &/ |/' otherlevs / take lowest of each subtree
		/ and highest of the lowests
	x:& rootlevs < highestother
	:x
    }
    jj: rootfilter[levels] / get rid of roots that are impossible
	/ trying to cut down cartesian product
    roots[0]@: jj 
	/ End of levels part
    cart_root: constructcartprod[roots]
    troot: findroot[targtree]
    mm: & cart_root[;0] = troot
    if[0 < #mm / roots first
	x: cart_root[mm]
	y: cart_root _di mm
	cart_root: x,y
    ]
    finalroots: ()
    i: 0
    / Now you want to match paths from each root of a subtree in the
    / data tree to some root in the data.
    / match holds a pair: query tree subtree root, target tree subtree root
    / if the path from the former to qroots[0] corresponds to the path
    / from the latter to its zero.
    match.qstart::() 
    match.tstart::()
    match.tend::()
    while[i < #cart_root
	flag: comparepath[qtree; qroots; targtree; cart_root[i]]
	if[flag 
		foundroots,: * cart_root[i],() 
      		if[dumpmatchflag
		   dontcareroots,:,` $ ($targtree) ,/: ("_") ,/: ($cart_root[i])
		]
		:1
	]
	i+: 1
    ]
    foundroots,: -33
    :0
}

/ Do paths to root match in the two trees? 
/ If so, then return 1 else 0.
/ Side effect to match structure.
comparepath:{[qtree; qroots; targtree; troots]
  numroots: #qroots
  i: 1
  while[i < numroots
	jj: & (match.qstart = qroots[i]) & (match.tstart = troots[i]) & (match.tend = troots[0])
	if[0 = #jj / not yet found
	 x:pathmatch[qtree;qroots[i]; qroots[0]; targtree; troots[i]; troots[0]]
	 if[x / save match
		match.qstart,: qroots[i]
		match.tstart,: troots[i]
		match.tend,: troots[0]
	 ]
	 if[~x; :0] / no match
	]
	i+: 1
  ]
  :1
}

miniformold:{[x]   y:  ,/ ($x) ,\: ("_");  :y} 

miniform:{[x]   
  if[~ (`"*") _in x
    :  ,/ ($x) ,\: ("_")  
  ]
  i: & x = `"*"
  if[0 _in i
	i: i _di 0
	x[1]: ` $ ("*"),($x[1])
  ]
  x[i-1]: ` $ ($x[i-1]) ,\: ("*")
  x: x _dv `"*"
  x: x _dv `"**"
  y:  ,/ ($x) ,\: ("_")  
  :y
} 

/ get all subsequences of list of length count
formlists:{[count; list] 
   out: ()
   i: 0
   while[(i < 1 + (#list) - count) 
     :[(count > 1)
	out,: ,list[i],/: formlists[count-1; (i+1) _ list]
	out,: ,list[i] ]
	i+: 1
   ]
   :out
}

/ The query may have variable length don't cares
/ Match takes this into account.
pathmatch:{[qtree;qsubroot; qallroot; targtree; tsubroot; tallroot]
  qp: findpath[qtree;qsubroot; qallroot]
  tp: findpath[targtree; tsubroot; tallroot]
  tps: miniform[tp]
  if[(`"?" _in qp) & (`"*" _in qp)
	/ fill the ?s in wherever they are found
	/ but do so in a way that allows all possible substitutions to *
	/ So get all subsequences of tp of length number of ?s
	/ and try each one as the substitution of ?.
	/ Not optimal but not too bad either.
	/ Have to do this because ? is replacing a symbol not a single letter.
	questpos: & qp = `"?"
	count: #questpos
	x: ,/formlists[count; tp] / get all subsequences of tp of 
		/ length count
	jj: 0
	/ Because of conversion to strings, we may allow some extra matches.
	while[jj < #x
		qptemp: qp
		qptemp[questpos]: x[jj]
		qptemps: miniform[qptemp]
		if[tps _sm qptemps; :1]
		jj+: 1
	]
	:0
  ]
  if[(`"?" _in qp) & (~ `"*" _in qp)
	if[~ (#qp) = (#tp); :0]
	/ fill the ?s in wherever they are found at their positions.
	questpos: & qp = `"?"
	qptemp: qp
	qptemp[questpos]: tp[questpos] / one to one substitutsion
	:qptemp ~ tp
  ]
  if[~ `"?" _in qp
    :tps _sm miniform[qp]
  ]
}

/ find a path in a tree from a subnodeid to a supernodeid
/ These start out as symbols but end up as strings.
/ Express as a string.
findpath:{[treeid; subnodeid; supernodeid]
   i: & treestats.treeid = treeid
   nodes: treestats.nodeid[i]
   pars: treestats.parent[i]
   labs: treestats.label[i]
   j: nodes ? subnodeid
   if[j = #nodes; !-3]
   out: labs[j]
   curnode: subnodeid
   curpar: pars[j]
   while[~ curnode = supernodeid
	j: nodes ? curpar
	out,: labs[j]
	curnode: curpar
	curpar: pars[j]
   ]
   :out
}

/ construct the cartesian product of a list of lists
constructcartprod:{[listoflists]
    numlists: #listoflists
    start: listoflists[0]
    i: 1
    while[i < numlists
	start: ,/start ,/:\: listoflists[i]
	i+: 1
    ]
    :start
}


/ given two lists of node numbers (say both using preorder numbering)
/ return 1 iff the two lists are in the same order
/ e.g. if preorder numbering is 2 3 1 6 in one case and 3 7 5 8 in the other
sameorder:{[list1; list2] (< list1) ~ (<list2)} 

/ given a list of node ids of a tree, convert those ids to preorder
/ will do this for each list associated with a query root
converttopre:{[mytree; list]
  i: & treestats.treeid = mytree
  j: treestats.nodeid[i] ?/: list
  :treestats.preid[i][j]
}

/ given a list of node ids of a tree, convert those ids to preorder
/ will do this for each list associated with a query root
converttolevel:{[mytree; list]
  i: & treestats.treeid = mytree
  j: treestats.nodeid[i] ?/: list
  :treestats.level[i][j]
}

converttopost:{[mytree; list]
  i: & treestats.treeid = mytree
  j: treestats.nodeid[i] ?/: list
  :treestats.postid[i][j]
}


/ For a given query tree qtree
/ we will have a sequence of roots qroots.
/ Associated with each qroot we will have a list of target trees
/ and root target node ids (a given target tree may
/ appear many times in a given list since for a particular qroot,
/ the same target tree may occur many times)
/ Our job is to determine which target trees match qtree
/ at all qroots and in the relationship.
/ Algorithm: 
/  see which target trees match at all qroots (i don't know how
/	to handle thresholds yet, not implemented)
/  for each such target tree try to match preorders and postorders. 
/ UNUSED
checkformatch_prepost:{[qtree; qroots; targtreeslist; targrootslist]
   out: ()
   numqroots: #qroots
   / Find targ trees for each qroot (list for each one)
   / that appear often enough
   x: ?:' targtreeslist
   y: multiintersect[x] / good ones
   if[0 = #y; :()] / no match
   i: 0
   while[i < numqroots
	jj: intersectleftindexes_multi[targtreeslist[i]; y]
	targtreeslist[i]@: jj
	targrootslist[i]@: jj
	i+: 1
   ]
   / Done with finding target trees for each qroot that appear often enough
   qrootspre: converttopre[qtree; qroots]
   qrootspost: converttopost[qtree; qroots]
   numtrees: #y
   x2: & goodmatch'[qtree; qroots; qrootspre; qrootspost; targtreeslist; targrootslist]'y 
   y[x2]
}


/ given a list of matching trees with their node ids, split these
/ so we get node ids and tree ids separately
/ initial: ,`treea_6 `treea_0 `treeb_0 `treeb_6 `treec_6 `treec_0
findtreeandroots:{[list]
  x: ,/list
  converttopair:{[lab] 
	i: lab ? "_"
	:(` $ lab[!i]; 0 $ (i+1) _ lab)
  }
  pair: converttopair'$x
  if[0 = #pair; :(();())]
  :(pair[;0]; pair[;1])
}

/ SET STUFF


/finds intersection of two lists
/ fastest of all
intersect: {[x;y]
  x,: ()
  y,: ()
  if[(#x) < (#y)
    i: x ?/: y
    j: & i < #x
    :x[?i[j]]
  ]
  i: y ?/: x
  j: & i < #y
  :y[?i[j]]
}

/finds intersection of two lists
/ fastest of all
hasintersect: {[x;y]
  x,: ()
  y,: ()
  i: x ?/: y
  : (&/i) < #x
}

/ x is a proper subset of y
propersubset:{[x;y]
  x,: ()
  y,: ()
  if[~ (#x) < (#y); :0] / must be smaller
  :(#x) = (#intersect[x;y])
}

subset:{[x;y] (#x) = (#intersect[x;y])}

subset:{[x;y] 
  j: y ?/: x
   / if any js are as big as #y, then there is a member of x not in y
  :0 = |/ j = #y
}

/ proportion of x that y covers
approxsubset:{[x;y]
  j: y ?/: x
  / any equal to #y are in x but not in y
  :(# & j < (#y)) % (#x)
}

/ finds indexes in x and y that intersect
/ If x and y are both sets, then the results will be of the same length
/ fastest of all
intersectindexes: {[x;y]
  i: x ?/: y / where each y hits
  j: & i < #x / those ys that hit
  :(i[j];j)
}

/finds indexes in x that intersect with y
intersectleftindexes: {[x;y]
  i: x ?/: y / where each y hits
  j: & i < #x / those ys that hit
  :i[j]
}

/ finds indexes in x that intersect with y
/ even if x is a multiset
intersectleftindexes_multi: {[x;y]
  k: y ?/: x
  : & k < #y
}

/finds intersection of two lists
/ and returns index pairs of matches. Assumes no duplicates
/ in either list
intersectbothindexes: {[x;y]
  x,: ()
  y,: ()
  i: x ?/: y
  pairs: (i ,' (!#y))
  k: & pairs[;0] < #x
  :pairs[k]
}


/ this is a set intersection so we remove duplicates
multiintersect:{[lists]
 size: #lists
 if[0 = #lists; :()]
 / if[2 > size; :lists]
 if[2 > size; :*lists]
 first: lists[0],()
 jj:  first ?/: (,/ ?:' lists[1+ !(size-1)]) / find indexes in first
 x: @[(1+#first) # 0; jj; + ; 1]
 x: (-1) _ x / delete missing entry
 kk: & x = size - 1
 :first[kk]
}

/ this is a set intersection so we remove duplicates
/ just like multiintersect except that the maxdist allows
/ a result if it is present in (#lists) - maxdist
/ of the input lists
threshintersect:{[lists; maxdist]
 size: #lists
 if[2 > size; :lists]
 / first: lists[0],() / no good because might be empty
 first: (?,/lists) ,() / union
 / jj:  first ?/: (,/ ?:' lists[1+ !(size-1)]) / find indexes in first
 jj:  first ?/: (,/ ?:' lists) / find indexes in first
 x: @[(1+#first) # 0; jj; + ; 1]
 x: (-1) _ x / delete missing entry
 / kk: & x > size - (2 + maxdist)
 kk: & x > size - (1 + maxdist)
 :first[kk]
}

/ A faster difference
differ:{[x;y]
  x,: ()
  y,: ()
  i: y ?/: x
  j: & i = #y
  :?x[j]
}

/ Count the difference
differcount:{[x;y]
  i: y ?/: x
  :# & i = #y
}

/ END OF SET STUFF
/ READ TREE BY CONVERSION FROM A FILE OF THE FORM
/ # tree|treeid|parentid|childrenid
/ treea|0|1 2 5 6 8
/ treea|2|3 4
/ treea|6|7
/ 
/ # treelabel|treeid|nodeid|label
/ treea|0|x
/ treea|1|y
/ treea|2|z
/ treea|3|w
/ treea|4|v
/ treea|5|q
/ treea|6|r
/ treea|7|p
/ treea|8|r


inputfile: "treein"
if[0 < #_i
	inputfile: _i[0]
]
/ Null values should be represented by having only blanks in a field.
/ Tables are separated by a single blank line

/ a list is numeric if each member is either empty, a period
/ or a digit
isnumeric:{[list]
  s: ,/list
  nums: "1234567890"
  x: nums ?/: s
  y: (#nums) > |/x / if greater then everything in s met its match
  if[1 = y; :2] / 2 means integer
  nums: "1234567890."
  x: nums ?/: s
  :(#nums) > |/x / if greater then everything in s met its match
	/ 1 means float and 0 means not a number
}

/ parses a field based on vertical bars
getfields:{[line]
  i: line = "|"
  j1: &i
  j2: &~i
  line @:j2
  size: #j1
  :(0,(j1 - !size)) _ line
}

/ get rid of blanks at either end of the string
delendblanks:{[string]
  if[0 = #string; :""]
  if[string ~ ,"" ; :""]
  i: & ~ string = " "
  if[(#string) = (#i); :string]
  if[0 = (#i); :""]
  string: (- ((#string)  - (1 + *|i))) _ string
  :(*i) _ string
}
  

/ Handles one line of input at a time according to table schema above.
/ A little more space efficient than the previous version.
processline:{[line]
 emptyorblank:{[line] (0 = #delendblanks[line])}
 emptyflag: emptyorblank[line]
 if[~emptyflag
  justempty:: 0
  if[line[0] = "#"
    newline: delendblanks'getfields[1 _ line]
    if[(1 < #newline) 
	globaltable:: *newline
	alltables,: ` $ globaltable
	globalatt:: 1 _ newline
	globalvals:: () / initialize
    ] / newline
  ] / table declaration
  if[~ line[0] = "#"
	newline: delendblanks'getfields[line]
	if[(1 < #newline) | (0 < #newline[0]) 
		globalvals,: ,newline
	]
  ] / end of test on data line
 ] / end of test on line is non-empty
 if[emptyflag & (0 = justempty)
	justempty:: 1
	/ if[ ((#globalatt) > 1) | (~ (#globalatt) = (^globalvals)[0])
		/ globalvals:: + globalvals / take transpose
	/ ]
	if[~ (#globalatt) = (#globalvals[0]); !-1]
	numericflag:()
	hh: 0
	while[hh < #globalatt
		numericflag,: isnumeric[globalvals[;hh]] 
		hh+: 1
	]
	inindex:: 0
	while[inindex < #globalatt
	  / table.att[2] :: ` $ vals[3]
		string1: globaltable, (".")
		string1,: globalatt[inindex]
		teststring: ("0 = #"), string1
		/ For big sizes, uncomment following two statements
		/ User will get an error message, but it will work.
		/ x: @[.: ; teststring; :]
		/ currentlyempty: *x / error means empty
		/ For big sizes, comment following
		currentlyempty: 1
		/ if empty, then assign, else append
		string2: :[currentlyempty; (":: "); (",: ")] 
		string3: :[0 = numericflag[inindex]
			"` $ "
			:[1 = numericflag[inindex]; "0.0 $ "; "0 $ "]]
		string3,: "globalvals[;inindex]"
		. string1, string2, string3
		if[1 = numericflag[inindex]
			globalindexes:: & 0N = . string1
			string3: string1, ("[globalindexes] :: `")
			. string3
		]
		inindex+: 1
	]
 ]
}


inputfromfile:{[filename]
  a: 0: filename
  globalvals:: ()
  oldglobalvals:: () / should not be necessary
  justempty:: 1 / as if we've just seen an empty line. should process table
  a,: ,,""
  processline'a
  makeintlist:{[line] (. $line), ()}
  tree.childrenid:: makeintlist'tree.childrenid
}


/ DISPLAY.


spit:{[list] 
  numtodisplay: 6
  x: ($list) ,\: " "
  out: ()
  while[0 < #x
	if[~ numtodisplay < #x
		out,: , (-1) _ ,/x
		x: ()
	]
	if[numtodisplay < #x
		out,: , (-1) _ ,/x[!numtodisplay]
		x: 6 _ x
	]
  ]
  :out
}

/ spit out a tree in indented notation.
dumptree:{[treeid]
  out: ,("Tree: "), ($treeid)
  indent: 0
  ikids: & tree.treeid = treeid
  parent: tree.parentid[ikids]
  children: tree.childrenid[ikids]
  ilab: & treelabel.treeid = treeid
  node: treelabel.nodeid[ilab]
  label: treelabel.label[ilab]
  if[0 < #parent
    out,: dump[parent; children; node; label; 0; 0]
  ]
  if[0 = #parent
    out,: ,($label)
  ]
  :out
}

/ this is a recursive procedure
/ that does the actual printing where the level of indent is proportional 
/ to the level.
dump:{[parent; children; node; label; parentindex; indent]
  i: parentindex
  out: ,(indent # " "), ($label[node ? parent[i]]),(" ("),($parent[i]),(")")
  indent+: 2
  j: 0
  while[j < #children[i]
	kid: children[i;j]
	ii: parent ? kid
	if[ii < #parent
		out,: dump[parent; children; node; label; ii; indent]
	]
	if[ii = #parent
		out,: ,(indent # " "), ($label[node ? kid]),(" ("),($kid),(")")
  	]
	j+: 1
  ]
  :out
}

/ END OF DISPLAY.

/ When filename is missing, then input data comes from tempin
/ When num is missing then an all-pairs test is done. 
/ Otherwise the first num trees from the input data are compared against
/ all the others.
/ k pathfix 
/ (+b <filenamefordb> | +q <queryfilename> +d <dbfilename> | +f <filename> )
/ [+m <maxdist> (0)] 
/ Program pathfix with the +b <filename> will build a database
/ from that file and will call it <filename>db with an extension
/ (either .l or .K depending on the operating system but you need not care).
/ Program pathfix with the +q queryfilename and +d dbfilename
/ will take the query file and the database file already produced (but
/ you need not specify the extension) and find out where the query is
/ in the database.
/ Program pathfix with the +f <filename> will take a file using
/ the normal tree format and will compare the first tree against all others.
/ +m maxdist is the maximum number of differences to allow.
/ +dumpquery says to dump the query tree in indented form 
/ to the file dumpedquery 
/ +dumpdb says to dump the database trees in indented form to the file dumpeddb
/ In this case, a difference is a path that is different.
opcode: `"randtree" / only for testing but that's the default
output: "data.out"
maxdist: 0
filenamefordb: "treein"
filename: "treein"
dbfilename: "dbfile" / should have a .l or .K extension, but take this away
queryfilename: "queryfile"
dumpqueryflag: 0
dumpdbflag: 0
dumpmatchflag: 0
specifictreeflag: 0 / if 1 then we are looking for specific trees
/ and they will be placed in specifictreefile
specifictreeindexes: () / to be used in findinsuf
specifictreefile: "ljlkjl"
specifictrees: () / list of trees to look at
okflag: 1
processargs:{[args]
  i: 0
  numinputs: 0
  while[(i < #args) & okflag
	okflag:: 0
	addextra: 0
	if[args[i] ~ "+n"
		opcode:: `"no-op"
		okflag:: 1
	]
	if[args[i] ~ "+b"
	    if[(i+1) < #args
		filenamefordb:: args[i+1]
		addextra: 1
		numinputs+: 1
		opcode:: `"+b"
		okflag:: 1
	    ]
	] 
	if[args[i] ~ "+q"
	    if[(i+1) < #args
		queryfilename:: args[i+1]
		addextra: 1
		numinputs+: 0.5
		opcode:: `"+d"
		okflag:: 1
	    ]
	]
	if[args[i] ~ "+d"
	    if[(i+1) < #args
		dbfilename:: args[i+1]
		kk: dbfilename ? "."
		if[kk < #dbfilename; dbfilename@: !kk]
		addextra: 1
		numinputs+: 0.5
		opcode:: `"+d"
		okflag:: 1
	    ]
	] 
	if[args[i] ~ "+f"
	    if[(i+1) < #args
		filename:: args[i+1]
		addextra: 1
		numinputs+: 1
		opcode:: `"+f"
		okflag:: 1
	    ]
	] 
	if[args[i] ~ "+o"
	    if[(i+1) < #args
		output:: args[i+1]
		addextra: 1
		numinputs+: 1
		okflag:: 1
	    ]
	] 
	if[args[i] ~ "+s"
	    if[(i+1) < #args
		specifictreefile:: args[i+1]
		specifictrees:: 1: specifictreefile
		specifictreeflag:: 1
		addextra: 1
		numinputs+: 1
		okflag:: 1
	    ]
	] 
	if[args[i] ~ "+m"
	    if[(i+1) < #args
		maxdist:: 0 $ args[i+1]
		addextra: 1
		okflag:: 1
	    ]
	] 
	if[args[i] ~ "+dumpquery"
		dumpqueryflag:: 1
		okflag:: 1
  	]
	if[args[i] ~ "+dumpdb"
		dumpdbflag:: 1
		okflag:: 1
  	]
	if[args[i] ~ "+dumpmatch"
		/ the matching tree
		dumpmatchflag:: 1
		okflag:: 1
  	]
	i+: 1 + addextra
	if[0 = okflag
		x: "format is k pathfix \n "
		x,: " (+b <filenamefordb> | +q <queryfilename> +d <dbfilename> | +f <filename> ) \n "
		x,: " [+m <maxdist> (0)] [+dumpquery] [+dumpdb] [+dumpmatch]\n "
		x,: " [+o <outputfile>] \n "
		x,: " [+n ] \n "
		x,: " e.g. k pathfix +b treein \n "
		x,: "   will form treeindb.l or treeindb.K \n" 
		x,: " k pathfix +n \n"
		x,: "   means load the functions by themselves.\n"
		x,: " k pathfix +q treeinquery +d treeindb \n"
		x,: "   will find the query tree in treeindb.\n"
		x,: " k pathfix +q treeinquery +d treeindb +o fooout\n"
		x,: "   will find the query tree in treeindb and put output in fooout.\n"
		x,: " k pathfix +q treeinquery +d treeindb +dumpquery +dumpmatch +m 2\n"
		x,: "   will find the query tree in treeindb within distance 2, put the\n"
		x,: "   query tree in indented form in the file dumpedquery,\n"
		x,: "   and the matching trees in the file dumpedmatch.\n"
		x,: " k pathfix +f treein +dumpquery +dumpdb +m 2\n"
		x,: "   will make the first tree in treein be the query tree;\n"
		x,: "   the remaining trees in treein be the database trees; \n"
		x,: "   will look for the query tree among the database trees\n"
		x,: "   within distance 2; will put the query tree in\n"
		x,: "   indented form in the file dumpedquery, and the\n"
		x,: "   database trees in indented form in the file dumpedb.\n"
		x,: "   NB. When using +q and +d, the database trees are not available to be dumped.\n"
		x,: "   NB. Similarly when using +b, there is no query tree.\n"
		` 0: x
		. "\\\\"
	]
  ]
}

/ STATISTICS

avg:{(+/ x) % # x}
var:{avg[_sqr x] - _sqr avg[x]}
std:{_sqrt var[x]}
cov:{avg[x * y] - avg[x] * avg[y]}
corr:{ (cov[x;y])%((std[x]) * (std[y]))}
/ END OF STATISTICS
/ SUFFIX TREE PART

/ Construct a suffix array of a string and a tree_node id of the
/ beginning of the suffix.
/ This consists of forming all the suffixes, then then sort them.
/ Create an array that holds only the pointers to the array.

/ form suffixes and then spit out the array as symbols
/ with associated node positions.
makesuffixold:{[string; treevec]
  cuts: !(#string) 
  out: cuts _\: string
  j: <out
  out@: j
  :(out; treevec[j])
}

/ creates a suffix array
/ Greg Heil move.
makesuffixold2:{[string]
  l: #string
  : < (!l) _\: string
}

/ creates a suffix array
/ Greg Heil move, but with an improvemnt to take advantage that
/ for trees, we need not compare more than the max path length.
/ This one attempts to do the further optimization of making rs even smaller
/ than maxlen
/ Doesn't work that well.
makesuffixopt:{[maxlen; string]
  if[0 / set to 1 for testing
	string: "abcabdabbabaz"
	maxlen: 4
  ]
  s: #string
  rs: _ maxlen % 2
  if[s < rs
    : < (!s) _\: string
  ]
  x: string[(!(s-rs)) +\: !rs],string[(s-rs)+(!rs)+'!:'(rs-!rs)]
	/ all strings of length rs (or less towards the end)
  firstshot: < x / indexes in order
	/ some may be equal however and those should be extended.
  y: x[firstshot]
  equals: 0, ~': y / put a 1 in an index location i 
		/ if x[firstshot[i]] = x[firstshot[i-1]]
  firstshot: betterorder[firstshot; equals; string; maxlen]
  :firstshot
}

/ equals has a 1 in location i 
/ if x[firstshot[i]] = x[firstshot[i-1]]
/ (see calling routine)
/ we have to break the ties.
betterorder:{[firstshot; equals; string; maxlen]
  ii: & equals / indexes in firstshot such that for each index i in ii
	/ x[firstshot[i]] = x[firstshot[i-1]]
  j: 0
  while[j < #ii
	equivclass: (firstshot[ii[j]-1], firstshot[ii[j]])
	equivclassinds: (ii[j]-1; ii[j])
		/ where in firstshot this is
	flag: 1
	while[flag
	  flag: 0 / don't repeat unless there is good reason to
	  if[(j+1) < #ii
		if[ii[j+1] = ii[j]+1
			equivclass,: firstshot[ii[j+1]]
			equivclassinds,: ii[j+1]
			flag: 1
			j+: 1	
		]
	  ]
	]
	firstshot[equivclassinds]: reorder[equivclassinds; equivclass
		string; maxlen; firstshot]
	  / now have to break the ties ??? unfinished, too involved
    	j+: 1
  ]
  :firstshot
}

/ look at string at locations equivclass up to maxlen
/ return the sort order and rearrange equivclassinds that way
reorder:{[equivclassinds; equivclass; string; maxlen; firstshot]
  tocompare: ()
  slen: #string
  i: 0
  while[i < #equivclass
	ind: equivclass[i]
	len: (slen - ind) & maxlen
	tocompare,: ,string[ind+!len]
	i+: 1
  ]
  jj: < tocompare
  :firstshot[equivclassinds[jj]]
}


/ creates a suffix array
/ Greg Heil move, but with an improvement to take advantage that
/ for trees, we need not compare more than the max path length.
makesuffix:{[maxlen; string]
  s: #string
  rs: maxlen
  if[s < rs
    : < (!s) _\: string / form suffixes of whole string
  ]
  x: string[(!(s-rs)) +\: !rs],string[(s-rs)+(!rs)+'!:'(rs-!rs)]
	/ all strings of length rs (or less towards the end)
  :< x / indexes in order
}


/ here we cut off each suffix at the next delimiter character
makesuffixopt2:{[maxlen; string]
  delimiter: "#"
  s: #string
  i: 0
  out: ()
  while[i < s
	jj: string ? delimiter
	if[0 < jj; out,: ,string[!jj]]
	string: 1 _ string
	i+: 1
  ]
  :< out / indexes in order
}

/ here we cut off each suffix at the next delimiter character
/ still not better
makesuffixopt3:{[maxlen; string]
  delimiter: `"#"
  ii: & string = delimiter
  s: #string
  i: 0
  out: ()
  while[i < s-1 / because last one is #
	if[i = ii[0] 
		ii: 1 _ ii
	]
	if[~ i = ii[0] 
		out,: ,string[i+!(ii[0] - i)]
	]
	i+: 1
  ]
  :< out / indexes in order
}

/ returns 0 if s1 = s2, 1 if s1 > s2 and -1 if s1 < s2
/ assumes we have strings
strcmp:{[s1; s2]
  if[s1 ~ s2; :0]
  if[((s1 < s2) ? 1) < ((s1 > s2) ? 1); :-1]
  :1
}

strcmp:{[s1; s2]
  f1: (s1 < s2) ? 1
  f2: (s2 < s1) ? 1
  / if[f1 < f2; :-1]
  / if[f2 < f1; :1]
  / :0
  : :[f1 < f2; -1; f2 < f1; 1; 0] 
}

/ are there count indexes between any two dividing points
/ assume indexes are sorted.
isenough:{[indexes; divides; count; query; labelarray]
   while[0 < #divides
	ii: #& indexes < divides[0]
	if[~ count > ii; :1] / there are enough
	divides: 1 _ divides
	indexes: ii _ indexes
   ]
   :0
}

foundit:{[indexes; divides; count; query; labelarray]
   out: ()
   while[0 < #divides
	ii: & indexes < divides[0]
	mycount: #ii
	if[~ count > mycount
		if[(query ~ labelarray[indexes[ii]]) & (1 = |/ -': indexes[ii])
			out,: *indexes[ii]
		]
	]
	divides: 1 _ divides
	indexes: mycount _ indexes
   ]
   :out
}

/ Find out which roots have the query string
/ The query is in string notation.
/ suffixarray is a triple (allstring; sufarray; rootsufarray)
/ consisting of a string and indexes into that string
/ where nodes begin.
/ roots are the places in the tree where a path begins.
/ The trouble is that we need a sufarray and a rootsubarray because
/ the indexes for the roots are different from in the string since each root
/ represents only the beginning of a string.
/ DOESN'T QUITE WORK
searchintreenew:{[query; suffixarray; rootsintree]
  flag: 1
  if[optflag / try to filter out this part of suffixarray
	if[(1 = #query)
		ind: & suffixarray[0] = query[0]
		:rootsintree[ind]
		/ that's all you need
	]
	if[(1 < #query)
		/ linear time on suffixarray[0]
		indexes: intersectleftindexes_multi[suffixarray[0]; query]
		indexes@: <indexes
		divides: & suffixarray[0] = `"#"
		/ divides@: <divides / already sorted
		/ flag:isenough[indexes; divides; #query; query; suffixarray[0]]
		ind: foundit[indexes; divides; #query; query; suffixarray[0]]
    		:rootsintree[ind]
	]
  ]
  if[flag
    ind: searchstring[query; suffixarray]
    :rootsintree[ind]
  ]
}

searchintree:{[query; suffixarray; rootsintree]
  flag: 1
  if[optflag / try to filter out this part of suffixarray
	if[(1 = #query)
		flag: query[0] _in suffixarray[0] 
		/ that's all you need
	]
	if[(1 < #query)
		/ linear time on suffixarray[0]
		indexes: intersectleftindexes_multi[suffixarray[0]; query]
		indexes@: <indexes
		divides: & suffixarray[0] = `"#"
		/ divides@: <divides / already sorted
		flag:isenough[indexes; divides; #query; query; suffixarray[0]]
	]
  ]
  if[~ flag
	:()
  ]
  if[flag
    ind: searchstring[query; suffixarray]
    :rootsintree[ind]
  ]
}

/ given a query and roots, find paths that match that query
/ and start in those roots
locateintreeold:{[query; rootnodes; suffixarray; rootsintree]
   ind: searchstring[query; suffixarray] / indexes where query paths begin
   starting: rootsintree[ind] / the root nodes of each matching path 
   ind@: intersectleftindexes[starting ; rootnodes]
	/ only want to keep indexes where the rootnodes match
	/ starting locations where the whole structure matches
   sizeof:{[x] i: & query = "_"; 1 + #i}
   x: rootsintree[ind +\: !(sizeof[query])]
   :x 
}

locateintreeold:{[queries; rootnode; suffixarray; rootsintree]
   ind: searchstring[; suffixarray]'queries / indexes where query paths begin
   / lengths: 1 + #:' &:' (queries =\: "_") / lengths of how long to match
   lengths: 1 + |/#:' queries / how long to match
   findgood:{[rootnode;rootsintree;list]
	i: & rootsintree[list] = rootnode
	:list[*i]
   }
   ind: findgood[rootnode; rootsintree]'ind
	/ indexes consistent with roots
   x: findnodes[ rootsintree]'[ind;lengths]
   out:,?,/x
   :out
}

locateintree:{[queries; rootnode; suffixarray; rootsintree]
   ind: searchstring[; suffixarray]'queries / indexes where query paths begin
   / lengths: 1 + #:' &:' (queries =\: "_") / lengths of how long to match
   lengths: 1 + |/#:' queries / how long to match
   findgood:{[rootnode;rootsintree;list]
	i: & rootsintree[list] = rootnode
	:list[i]
   }
   ind: ,/findgood[rootnode; rootsintree]'ind
	/ indexes consistent with roots
   x: findnodes[ rootsintree]'[ind;lengths]
   out:,?,/x
   :out
}

/ for each index
findnodes:{[rootsintree; index; length] 
	:rootsintree[index+!(length & ((#rootsintree) - index))]
}

/ search for query in suffixarray
/ and find the positions of that query in suffixarray
/ Note that suffixarray is a pair (sequence of symbols; indexes in order)
searchstring:{[query; suffixarray]
  rootindexes: suffixarray[1] / should be same as indexes
  indexes: suffixarray[1]
  dbstring: suffixarray[0]
  qlen: #query
  dlen: #dbstring
  out: ()
  while[0 < #indexes
	half: _ (#indexes) % 2
	testind: indexes[half]
	roottestind: rootindexes[half]
	if[~ (testind + qlen) > dlen 
		/ e.g. if testind = 8, qlen = 1, and dlen = 9, then ok
		x: strcmp[query; dbstring[testind+!qlen]]
		if[x = 1 / query is too big
			indexes: (1 + half) _ indexes
			rootindexes: (1 + half) _ rootindexes
		]
		if[x = -1
			indexes@: !half
			rootindexes@: !half
		]
		if[x = 0  / match, may be one of several places
			/ all are consecutive
			out,: roottestind
			jj: half+1
			flag: 1
			while[(jj < #indexes) & (flag)
				testind: indexes[jj]
				roottestind: rootindexes[jj]
				if[(testind + qlen) > dlen
					flag: 0 / no way
				]
				if[~ (testind + qlen) > dlen
					x: strcmp[query;dbstring[testind+!qlen]]
					:[x = 0
						out,: roottestind
						flag: 0]
				]
				jj+: 1
			]	
			jj: half-1
			flag: 1
			while[(jj > -1) & (flag)
				testind: indexes[jj]
				roottestind: rootindexes[jj]
				if[(testind + qlen) > dlen
					flag: 0 / no way
				]
				if[~ (testind + qlen) > dlen
					x: strcmp[query;dbstring[testind+!qlen]]
					:[x = 0
						out,: roottestind
						flag: 0]
				]
				jj-: 1
			]	
			indexes: ()
			rootindexes: ()
		]
		
	]
	if[(testind + qlen) > dlen
		endlen: dlen - testind / distance in dbstring
		x: strcmp[query[!endlen]; dbstring[testind+!endlen]]
		if[x > (-1)
			/ even if 0, this is not a good position
			/ since we had to shorten query
			/ so real match should be greater
			indexes: (1 + half) _ indexes
			rootindexes: (1 + half) _ rootindexes
		]
		if[~ x > (-1)
			indexes@: !half
			rootindexes@: !half
		]
	]
  ]
  :out
}


/ END SUFFIX TREE PART


/ PATHFIX PROCESSING


/ take a path of numbers and extend it through all its children
extendpath:{[numberpath; parent; children]
  i: parent ? *|numberpath
  if[i = #parent; :,numberpath]
  if[1 = #children[i]
	y:  ,/extendpath[;parent; children]',numberpath,children[i]
	:y
  ]
  allpaths: numberpath,/:children[i] 
  x:  ,/extendpath[; parent;children]'allpaths
  :x
}
  

/ list of labels
formlab:{[x] :x}

/ assuming a symbol list is going to be extended to strings with _ in between,
/ here are the beginnings of the strings
findbeginnings:{[symblist]
  slist: $symblist
  counts: 1 + #:'slist
  :0, +\ (-1) _ counts
}

/ single label that is one long label
/ formlabnew:{[x]  y: (-1) _ ,/ ($x) ,\: ("_");  :y}
formlabnew:{[x]  y:  ("_") , (-1) _ ,/ ($x) ,\: ("_");  :y} 
	/ do want initial _ but not trailing one

/ take a path of node ids and find the corresponding labels
converttolabel: {[labellabels; labelnodeids; path]
  i: intersectleftindexes[labelnodeids; path,()]
  :formlab[labellabels[i]]
}

/ take a path of node ids and find the corresponding labels
addtreeid:{[treeid;path]
  : ` $ ($treeid) ,/: ("_") ,/: ($path)
}

/ return all the paths for this tree id
/ and then convert this to a suffix tree
/ In detail: each path is formed; let us say it is of length k.
/ Then there will be k node ids one for each starting point. 
/ e.g. if there is a path from root to leaf in T consisting of letters
/ a b c and going through nodes 0 3 5, then the result of the pah
/ will be (a, b, c), (T_a, T_b, T_c).
/ Then the suffix paths will be `c T_c, `bc T_b, `abc T_a.
producepaths:{[treeid]
  i: & tree.treeid = treeid
  parent: tree.parentid[i]
  children: tree.childrenid[i]
  j: & treelabel.treeid = treeid
  labelnodeids: treelabel.nodeid[j] 
  labellabels: treelabel.label[j] 
  root: * differ[parent; ,/children]
  paths: extendpath[root; parent; children]
	/ For Cole method, must extend from every node along the
	/ spine of the query tree.
  labels: converttolabel[labellabels; labelnodeids]'paths
  newpaths: :[suffixarray; addtreeid[treeid]'paths; paths]
	/ append numbers to trees or not
  / x: makesuffix'[labels; newpaths]
  / :x
  :(labels;newpaths)
}

/ In case where we have variable length don't cares in query tree,
/ form a partial tree given a bunch of nodes. Label this tree by its root.
/ Otherwise, algorithm is similar to producepaths.
producepaths_partial:{[treeid; root; nodes]
  i: & tree.treeid = treeid
  parent: tree.parentid[i]
  children: tree.childrenid[i]
  mm: intersectleftindexes_multi[parent;nodes]
  parent@: mm
  children@: mm
  children: intersect[nodes]'children / don't go beyond certain children
  counts: #:' children
  mm2: & 0 < counts
  parent@: mm2
  children@: mm2
  j: & treelabel.treeid = treeid
  labelnodeids: treelabel.nodeid[j] 
  labellabels: treelabel.label[j] 
  paths: extendpath[root; parent; children]
  labels: converttolabel[labellabels; labelnodeids]'paths
  newpaths: :[suffixarray 
	addtreeid[` $ ($treeid), ("+"), ($root)]'paths 
	paths]
	/ append numbers to trees or not
  / x: makesuffix'[labels; newpaths]
  / :x
  :(labels;newpaths)
}

/ cut up a tree having vldcs and question marks.
findpartitions:{[treeid]
  partroot:: () / have a root for each partition
  partnodes:: () / and a bunch of nodes
  i: & tree.treeid = treeid
  parent: tree.parentid[i]
  children: tree.childrenid[i]
  i: & treelabel.treeid = treeid
  nodes: treelabel.nodeid[i]
  labels: treelabel.label[i]
  root: * differ[parent; ,/children]
  partroot,: root / this can be a *???
  partnodes,: ,root
  rootlab: $ labels[nodes ? root]
  dontcareflag:: ("*") = rootlab[0] 
  dontcareflag|: ("?") = rootlab[0] 
  if[~dontcareflag
    recurfindpartitions[root; parent; children; nodes; labels; root]
  ]
  if[dontcareflag / root is a don't care
     j: parent ? root
     kids: children[j]
     kids: ,//finddontcare[parent;children;nodes;labels]'kids
     partroot,: kids
     partnodes,: ,:'kids
	/ assumes no dontcares among immediate children of root
	/ (march8)
	/ If we wanted that, then we could recursively look for kids
	/ that are don't cares and make them further separate partitions
	/ by putting them in partroot and partnodes. Then we would
	/ also need to change other places where we handle dontcareflag.
     recurfindpartitions[; parent; children; nodes; labels; ]'[kids;kids]
  ]
  :(partroot; partnodes)
}

/ find the non-don't cares underneath mynode 
finddontcare:{[parent;children;nodes;labels; mynode]
  out: ()
  j: nodes ? mynode
  mylab: $labels[j]
  flag: ("*") ~ mylab[0]
  flag|: ("?") ~ mylab[0]
  if[~ flag
	:mynode
  ]
  if[flag
	jj: parent ? mynode
	if[jj < #parent
		mykids: children[jj]
		out,:,/finddontcare[parent;children;nodes;labels]'mykids
	]
  ]
  :out
}

/ for a tree that has vldcs, divide up the tree
/ by identifying root and partitions
/ This is starting a new one at root.
recurfindpartitions:{[currentroot; parent; children; nodes; labels; next]
  j: parent ? next
  if[j = #parent; :,(root;root)] / root is by itself
  mm: partroot ? currentroot
  kids: children[j]
  i: 0
  labelind: nodes ?/: kids
  while[i < #kids
	kidlab: $ labels[labelind[i]]
	flag: ("*") = kidlab[0] / we don't care about vldcs having counts
	flag|: ("?") = kidlab[0] / we don't care about vldcs having counts
		/ if we want to match *[3] meaning three characters,
		/ then put in three question marks.
	if[flag
		jj: parent ? kids[i] / want children of this
		if[jj < #parent
			grandkids: children[jj]
			kk: 0
			/ Each grandkid starts its own tree which
			/ is what we want.
			/ If a grandkid is a don't care then skip to next gen
			while[kk < #grandkids
			   x: nodes ? grandkids[kk]
			   kidlab: $ labels[x]
			   flag2: ("*") = kidlab[0] 
			   flag2|: ("?") = kidlab[0] 
			   if[~flag2
				partroot,: grandkids[kk]
				partnodes,: ,grandkids[kk]
				recurfindpartitions[grandkids[kk]; parent; children; nodes; labels; grandkids[kk]]
			   ]
			   if[flag2
				/ skip to next generation
				jj1: parent ? grandkids[kk]
				if[jj1 < #parent
					grandkids,: children[jj1]
				]
			   ]
			   kk+: 1
			]
		]
	]
	if[~ flag
  		partnodes[mm],: kids[i]	
		recurfindpartitions[currentroot; parent; children; nodes; labels; kids[i]]
	]
	i+: 1
  ]
  :out
}

/ for a tree that has vldcs, divide up the tree
/ by identifying root and partitions
/ This is starting a new one at root.
recurfindpartitionsold:{[currentroot; parent; children; nodes; labels; next]
  j: parent ? next
  if[j = #parent; :,(root;root)] / root is by itself
  mm: partroot ? currentroot
  kids: children[j]
  i: 0
  labelind: nodes ?/: kids
  while[i < #kids
	kidlab: $ labels[labelind[i]]
	flag: ("*") = kidlab[0] / we don't care about vldcs having counts
	flag|: ("?") = kidlab[0] / we don't care about vldcs having counts
		/ if we want to match *[3] meaning three characters,
		/ then put in three question marks.
	if[flag
		jj: parent ? kids[i] / want children of this
		if[jj < #parent
			grandkids: children[jj]
			kk: 0
			while[kk < #grandkids
				partroot,: grandkids[kk]
				partnodes,: ,grandkids[kk]
				recurfindpartitions[grandkids[kk]; parent; children; nodes; labels; grandkids[kk]]
				kk+: 1
			]
		]
	]
	if[~ flag
  		partnodes[mm],: kids[i]	
		recurfindpartitions[currentroot; parent; children; nodes; labels; kids[i]]
	]
	i+: 1
  ]
  :out
}

/ Sept 2000, perhaps don't include the rtree prefix it takes too much space.
/ What this produces is a pair consisting
/ of a string of all the paths in the tree,
/ a list of start locations within that string such that the list
/ is lexicographically ordered.
/ Following the pair are the locations according to preorder where
/ the path begins. 
suffixproducepaths:{[treeid]
  pair: producepaths[treeid]
  if[0
   pair[0]: (`a `b `c
	`a `d
	 `a `e `f)
   pair[1]: (`"0" `"1" `"2"
	`"0" `"3"
	`"0" `"4" `"5")
  ]
  / :makesuffixold'[pair[0]; pair[1]]
  / sufarrays: makesuffix'pair[0]
  / roots: pair[1] @' sufarrays
  maxlen: |/#:' pair[0] / maximum length of any reasonable suffix
		/ that is any suffix not stopping at a #
  allsymbs: ,/ pair[0] ,\: (`"#")
  allstring: formlabnew[allsymbs] / string with _ in between
  stringind: findbeginnings[allsymbs]
  rootsufarray: makesuffix[maxlen; allsymbs] / done at the level of symbols
	/ gives an array in lexicographic order
  x: ,/ pair[1] ,\: `"#" / pair[1] are locations in the tree of nodes
		/ either in preorder or someorder
  :((allsymbs; rootsufarray); x) / early return
  / roots: x @ rootsufarray / roots according to lexicographic order of paths
  roots: x  / don't rearrange roots, since you will get a pointer.
  sufarray: stringind[rootsufarray]
  numstops: # pair[0] 
  / all of the following cuts assume that the delimiter comes first!!
  / possible bug in the future ???
  sufarray: numstops _ sufarray
  rootsufarray: numstops _ rootsufarray
  / note, don't take away initial roots (the ones that match the delimiter)
	/ to ensure that rootsufarray still works.
  :( (allstring; sufarray; rootsufarray); roots)
}

/ extend a suffix tree given paths and start nodes
buildsuffixtree:{[treeid; paths; startlists] 
  buildone[treeid]'[,/paths;,/startlists]
}

/ actually build the suffix tree
/ Start at the anchor and put the startlist there and then descend.
/ suffixtree.nextchars: () / an array of suffix tree chars
/ suffixtree.nextnodes: () / an array of suffix tree nodes
/ suffixtree.startnodes: () / nodes in the original trees
/ suffixtree.starttrees: () / original tree name
/ the branch from the root of the suffix tree
/ anchorsuffixtree.nextchar: () / first char of a path
/ anchorsuffixtree.nextnode: () / next place to look
buildone:{[treeid; path; startlist]
  p: path
  first: *p
  numstarts: #startlist
  ii: anchorsuffixtree.nextchar  ? first
  if[ii = #anchorsuffixtree.nextchar
	/ not present
	anchorsuffixtree.nextchar,: first
	globalnextnode+: 1
	anchorsuffixtree.nextnode,: globalnextnode
		/ initialize point in suffix tree
	suffixtree.nextchars,: ,()
	suffixtree.nextnodes,: ,()
	suffixtree.startnodes,: , ()
	suffixtree.starttrees,: , ()
  ]
  nextindex: anchorsuffixtree.nextnode[ii]
  p: 1 _ p
  while[0 < #p
	first: *p
	chars: suffixtree.nextchars[nextindex]
	kk: chars ? first
	if[kk = #chars
		suffixtree.nextchars[nextindex],: first
		globalnextnode+: 1
		suffixtree.nextnodes[nextindex],: globalnextnode
		/ initialize point in suffix tree
		suffixtree.nextchars,: ,()
		suffixtree.nextnodes,: ,()
		suffixtree.startnodes,: , ()
		suffixtree.starttrees,: , ()
		newindex: globalnextnode
	]
	if[kk < #chars
		newindex: suffixtree.nextnodes[nextindex; kk]
  	]
	nextindex: newindex
	p: 1 _ p
  ]
  suffixtree.startnodes[nextindex],: startlist
  suffixtree.starttrees[nextindex],: numstarts # treeid
}

/ suffixtree.nextchars: () / an array of suffix tree chars
/ suffixtree.nextnodes: () / an array of suffix tree nodes
/ the branch from the root of the suffix tree
/ anchorsuffixtree.nextchar: () / first char of a path
/ anchorsuffixtree.nextnode: () / next place to look
/ anchorsuffixtree.startnodes: () / nodes in the original trees
/ anchorsuffixtree.starttrees: () / original tree name
reorganizesuffixtree:{[]
  i: < anchorsuffixtree.nextchar
  anchorsuffixtree.nextnode@: i
  anchorsuffixtree.nextchar@: i
  jj: 0
  while[jj < #suffixtree.nextchars
	if[1 < #suffixtree.nextchars[jj]
		x: < suffixtree.nextchars[jj]
		suffixtree.nextchars[jj]@: x
		suffixtree.nextnodes[jj]@: x
		/ don't need to change other suffix tree terms
	]
	jj+: 1
  ]
}

/ is s1 a prefix of s2
isprefix:{[s1; s2]
  if[(#s1) > #s2; :0]
  :s1 ~ s2[!#s1]
}

/ find indexes of local superstrings
findsuperstrings:{[paths]
  jj: 0
  supers: ()
  while[jj < (#paths) - 1
	if[~ prefix[paths[jj]; paths[jj+1]]
		supers,: jj
	]
	jj+: 1
  ]
  supers,: jj / last one
}

/ creates structure of form (path; endpositions in path; starts
/ for those end positions)
reducetolongest:{[paths ; starts]
  outpath: ()
  outpositions: ()
  outstarts: ()
  superstrings: findsuperstrings[paths]
  groups: superstrings _ !#paths
  jj: 0
  while[jj < #groups
	outpath,: ,paths[*|groups[jj]]
	
	
	jj+: 1
  ]
  :(outpath; outpositions; outstarts)
}

/ END OF PATHFIX tree
/ SUFFIX ARRAY SEARCH
/ Suffix array is in suf

/ At this point we handle wild cards separately.
/ So we won't see any here.
/ JULY 2001: 1. If we have filter information, then look only in
/ parts of the suffix tree where the relevant tree is present.
/ 2. If we know the threshold and a particular path is not
/ present, then mark that tree as of distance at least 1, so
/ perhaps need not be looked at.
/ Sort the query paths by length to take full advantage of this.
/ suf.starts has the trees
findinsuf:{[labelpath]
  if[doneflag; :()] / we know there are no matches
  if[specifictreeflag
	if[0 = #specifictreeindexes
	  specifictreeindexes:: intersectleftindexes[suf.treeids; specifictrees]
        ]
	ii: specifictreeindexes
	x: searchintree[labelpath]'[suf.paths[ii]; suf.starts[ii]]
  	counts: #:' x
  	jj: & counts = 0 / no matches
  	suf.diffs[ii[jj]]+: 1
  	/ kk: & suf.diffs > maxdist
  	/ specifictrees:: differ[specifictrees; suf.treeids[kk]]
  	kk: & suf.diffs[ii[jj]] > maxdist / only these can change
	if[0 < #kk
  	  / specifictrees:: differ[specifictrees; suf.treeids[ii[jj[kk]]]]
	  xx: differ[ii; ii[jj[kk]]]
	  specifictreeindexes:: xx
	  if[0 = #xx; doneflag:: 1]
	]
   ]
  if[~ specifictreeflag
  	x: searchintree[labelpath]'[suf.paths; suf.starts]
  ]
  :?,/x
}

/ given that we know where there a root that matches, find the
/ specific nodes that match
locateinsufold:{[rootnodes; labelpath]
   x: locateintreeold[labelpath; rootnodes]'[suf.paths; suf.starts]
   :x
}

treeid:{[x] y: $x; i: y ? "_"; :y[!i]}

locateinsuf:{[rootnode; labelpaths]
   rootnode: * rootnode,() / make sure it's not a list
   treeid:{[x] y: $x; i: y ? "_"; :y[!i]}
   i: ((treeid[rootnode]) ~/: (treeid'*:'suf.starts)) ? 1
   x: locateintree[labelpaths; rootnode;suf.paths[i]; suf.starts[i]]
   :x
}


/ END OF SUFFIX ARRAY SEARCH

/ FILTER FUNCTIONS

/ Update the filter tree counts and tree grups
/ currently based on labels and label pairs.
updatetreecount:{[mytree; group]
    / Now put in groups
    mmm: treegroup.tree ? mytree
    flag: mmm < #treegroup.tree
    if[flag
	treegroup.groups[mmm],: group
    ]
    if[~ flag
	treegroup.groups,: ,group
	treegroup.tree,: mytree
    ]
    / Now do treecount.
    mmm: & mytree _in/: treecount.trees
    mmm1: treecount.trees[mmm] ?\: mytree / positions in each of those
    if[0 < #mmm
     qq:0
     while[qq < #mmm
       treecount.counts[mmm[qq];mmm1[qq]]:: 0 / reset counts to 0.
       qq+: 1
     ]
    ]
    j: & treelabel.treeid = mytree
    labelnodeids: treelabel.nodeid[j] 
    labellabels: treelabel.label[j] 
    i: & tree.treeid = mytree
    parent: tree.parentid[i]
    parentlab: labellabels[labelnodeids ?/: parent]
    children: tree.childrenid[i]
    counts: #:' children
    newpars: ,/counts #' parentlab
    newchil: labellabels[labelnodeids ?/: ,/children]
    parchil: ($newpars) ,' ("_") ,/: ($newchil)
    x: parchil, $labellabels
    part: = x
    counts: #:' part
    uniqs: ?x
    hashinsert[mytree]'[uniqs; counts]
}


/ BIG CHANGE
/ upsert count of mytree in the data structure
hashinsert:{[mytree; string; count]
    x:  _ic'string
    internalstate: 0
    while[4 < #x
        internalstate+: */x[!4]
        internalstate: internalstate ! modval
        x: 4 _ x
    ]
    internalstate+: */x
    i: internalstate ! modval
    xxx: treecount.trees[i]
    jj: xxx ? mytree
    flag: jj < #xxx
    if[flag
	treecount.counts[i;jj]+: count
    ]
    if[~ flag
	treecount.trees[i],: mytree
	treecount.counts[i],: count
    ]
}
    
    
hashinsert:{[mytree; string; count]
    x:  31 _sv _ic string
    i: x ! modval
    xxx: treecount.trees[i]
    jj: xxx ? mytree
    flag: jj < #xxx
    if[flag
	treecount.counts[i;jj]+: count
    ]
    if[~ flag
	treecount.trees[i],: mytree
	treecount.counts[i],: count
    ]
}
    
    
/ END OF FILTER FUNCTIONS

/ SUFFIX TREE SEARCH

/ find a path in the suffix tree
/ suffixtree.nextchars: () / an array of suffix tree chars
/ suffixtree.nextnodes: () / an array of suffix tree nodes
/ suffixtree.startnodes: () / nodes in the original trees
/ suffixtree.starttrees: () / original tree name
/ the branch from the root of the suffix tree
/ anchorsuffixtree.nextchar: () / first char of a path
/ anchorsuffixtree.nextnode: () / next place to look
findinsuffixtree:{[labelpath]
  first: *labelpath
  outnode: () / nodes that match
  outtree: () / trees that match
  i: anchorsuffixtree.nextchar _bin first
  if[i = #anchorsuffixtree.nextchar; :(outnode; outtree)]
  if[~ first = anchorsuffixtree.nextchar[i]; :(outnode; outtree)]
  nextindex: anchorsuffixtree.nextnode[i]
  p: 1 _ labelpath
  while[0 < #p
	first: *p
	i: suffixtree.nextchars[nextindex] _bin first
  	if[i = # suffixtree.nextchars[nextindex]; :(();())]
  	if[~ first = suffixtree.nextchars[nextindex;i]; :(();())]
	nextindex: suffixtree.nextnodes[nextindex;i]
	p: 1 _ p
  ]
  allnodes: ?,/finddesc[nextindex] / find all descendant indexes
  x:(,/suffixtree.starttrees[allnodes]; ,/suffixtree.startnodes[allnodes])
  :x
}

/ find descendant nodes from node id
finddesc:{[nodeid]
  out: nodeid
  out,: ,/ finddesc'suffixtree.nextnodes[nodeid]
  :out
}

/ END OF SUFFIX TREE SEARCH

/ COMPRESSION

/ Go along the suffix tree and find a prefix that gives
/ max compression: at least a minimum length minlength,
/ First node is a huge array with all strings under minlength
/ and their corresponding stop points.
/ If a string in the first node is too short, then there
/ is a forward pointer to the next node.

/ GENERATE DATA

/ generate kids and fill in data structures
genkids:{[treeid; labels; nodeids; rootlabel; root]
  treelabel.treeid,: treeid 
  treelabel.nodeid,: root 
  treelabel.label,: rootlabel 
  numlabels: #labels
  choices: (numlabels) & (#nodeids)
  if[0 = choices; :()]
  kids: nodeids[choices _draw -#nodeids]
  x: *1 _draw #kids / take only some of those
  kids@: !x
  if[0 = #kids; :nodeids]
  kidlabels: labels[(#kids) _draw -numlabels]
  tree.treeid,: treeid
  tree.parentid,: root
  tree.childrenid,: ,kids
  remaining: differ[nodeids; kids] / not available to kids
  i: 0
  while[i < #kids
	remaining: genkids[treeid; labels; remaining; kidlabels[i]; kids[i]]
	i+: 1
  ]
  :remaining
}

/ generate a tree by putting data into
/ tree(`childrenid `treeid `parentid)
/ and treelabel(`treeid `nodeid `label).
gentree:{[size; treeid]
  labels: `a1 `b1 `c1 `d1 `e1
  numlabels: #labels
  nodeids: !size
  root: nodeids[*1 _draw #nodeids]
  rootlabel: labels[* 1 _draw #labels]
  nodeids: nodeids _dv root
  genkids[treeid; labels; nodeids; rootlabel; root]
}

/ generate kids of a subtree
genkidssubtree:{[treeid; labels; nodes; rootlabel; root; parents; children]
  treelabel.treeid,: treeid 
  treelabel.nodeid,: root 
  treelabel.label,: rootlabel 
  i: parents ? root
  if[i = #parents; :()]
  kids: children[i]
  x: 1 + *1 _draw #kids / take only some of those, but at least 1
  kids@: !(x)
  if[0 = #kids; :()]
  ii: intersectleftindexes[nodes; kids]
  kidlabels: labels[ii]
  tree.treeid,: treeid
  tree.parentid,: root
  tree.childrenid,: ,kids
  i: 0
  while[i < #kids
	genkidssubtree[treeid; labels; nodes; kidlabels[i]; kids[i]; parents; children]
	i+: 1
  ]
  :()
}

/ gen an interior subtree of a tree
/ tree(`childrenid `treeid `parentid)
/ and treelabel(`treeid `nodeid `label).
gensubtree:{[treeid]
  j: & tree.treeid = treeid
  if[0 = #j; :()]
  parents: tree.parentid[j]
  children: tree.childrenid[j]
  k: & treelabel.treeid = treeid
  labelnodes: treelabel.nodeid[k]
  labellabels: treelabel.label[k]
  j1: * 1 _draw #parents
  n: parents[j1]
  j: labelnodes ? n
  root: labelnodes[j]
  rootlabel: labellabels[j]
  newtreeid: ` $ ($treeid), ("sub")
  genkidssubtree[newtreeid; labellabels; labelnodes; rootlabel; root; parents; children]
}


/ DATA

treelabel.treeid: ()
treelabel.nodeid: ()
treelabel.label: ()

tree.treeid: ()
tree.parentid: ()
tree.childrenid: ()


/ in this new version of the suffix tree, we have  a
/ node in the suffix tree, an array of nextlet
/ an array of nextnode
/ and an array of start nodes from the original tree
suffixtree.nextchars: () / an array of suffix tree chars
suffixtree.nextnodes: () / an array of suffix tree nodes
suffixtree.startnodes: () / nodes in the original trees
suffixtree.starttrees: () / original tree name

/ the branch from the root of the suffix tree
anchorsuffixtree.nextchar: () / first char of a path
anchorsuffixtree.nextnode: () / next place to look

globalnextnode: -1 / global index of the next node.


/ EXECUTION

/ TIME TESTING HARNESS (REMOVE / if you want this)
/ .time.set`.k
/ start: _t
/ END OF TIME TESTING HARNESS (REMOVE / if you want this)   

text:()
starttime: _t
processargs[_i]
if[(opcode = `randtree) 
	numtrees: 10
	size: 50
	treeids: ` $ ("rtree") ,/: ($!numtrees)
	gentree[size]' treeids
	/ gensubtree'treeids
	text,: ,"Input from random tree generation. "
]
if[opcode = `"+b" 
	inputfromfile[filenamefordb]
	text,: ,("Input from build file "), filenamefordb
]
if[(opcode = `"+d" )
   text,: ,("Input from database file "), dbfilename
   text,: ,("   and from query file "), queryfilename
]
if[opcode = `"+f" 
   inputfromfile[filename]
   text,: ,("Input from file "), filename, (" whose first tree is the query")
]
  if[dumpdbflag
    if[opcode = `"+b"
     "dumpeddb" 0: ,/dumptree'?treelabel.treeid
    ]
    if[opcode = `"+f"
     "dumpeddb" 0: ,/dumptree'(1 _ ?treelabel.treeid)
    ]
  ]

suffixarray: 1
if[(1 = suffixarray) & (opcode _in `"+b" `"+f" `randtree)
/ form suffix array
	if[opcode _in `"+b" `randtree
	  uniqtrees: ? tree.treeid
	  / this part is for the +b case
    	  modval: 997 
    	  treecount.trees: (modval) # ,() / list of trees at a hash loc
    	  treecount.counts: (modval) # ,() / list of counts at a hash lo
    	  treegroup.tree: ()
    	  treegroup.groups: () / this is a list
    	  xx: ` $ filenamefordb, ("db")
    	  / x1: @[.:; "xtreecount: 1: \"treecount\""; :]
    	  x1: @[.:; "1: \"treecount\""; :]
	  if[0 = x1[0]; pair1: . x1[1]]
    	  / x2: @[.:; "xtreegroup: 1: \"treegroup\""; :]
    	  x2: @[.:; "1: \"treegroup\""; :]
	  if[0 = x2[0]; pair2: . x2[1]]
	  if[0 = x1[0] / there was a treecount
		treecount.trees: pair1[0][1]
		treecount.counts: pair1[1][1]
		treegroup.tree: pair2[0][1]
		treegroup.groups: pair2[1][1]
	  ]
	]
	if[opcode = `"+f"
	  uniqtrees: 1 _ ? tree.treeid
	]
  allpaths: ()
  allstarts: ()
  jj: 0
  while[jj < #uniqtrees
    mytree: uniqtrees[jj]
    pair: suffixproducepaths[mytree] / for each tree produce suffixes
    allpaths,: ,pair[0] / path in symbols
    allstarts,: ,pair[1] / path in tree ids
    if[opcode = `"+b"
      / Add elements from mytree to the treecount table.
      / You will process these with +q. 
      updatetreecount[mytree; xx]
    ]
    jj+: 1
  ]
  / part: = allpaths / unneeded
  suf.starts: allstarts
  allstarts: () / cleanup
  suf.paths: allpaths
  if[opcode = `"+b" 
	xx: filenamefordb, ("db")
	xx 1: suf
	zx: fillall'?tree.treeid
	xx: xx, ("stats")
	xx 1: treestats
	"treecount" 1: treecount / filtering stuff
	"treegroup" 1: treegroup
  ]
  / triples: reducetolongest[suf.paths ; suf.starts]
	  / creates structure of form (path; endpositions in path; starts
	  / for those end positions)
  / compressedsuf.path: triples[;0]
  / compressedsuf.positions: triples[;1]
  / compressedsuf.starts: triples[;2]
  / suffix array formed
]
/ suffix tree doesn't follow the options probably since i think it's too slow
if[(0 = suffixarray) & (opcode _in `"+b" `"+f")
	if[opcode = `"+b"
	  treelist: ? tree.treeid
	]
	if[opcode = `"+f"
	  treelist: 1 _ ? tree.treeid
	]
  while[0 < #treelist
	mytree: *treelist
	pair: suffixproducepaths[mytree]
	paths: pair[;0]
	starts: pair[;1]
	part: = paths
	newstarts: ,/' starts[part]
	newpaths: ? paths
	i: < newpaths
	buildsuffixtree[mytree; newpaths[i]; newstarts[i]] 
		/ put data in suffixtree
	treelist: 1 _ treelist
  ]
  reorganizesuffixtree[] / sort everything in the database
]

/ query part
doquery:{[]
  matched: ()
  if[ opcode = `"+d"
    inputfromfile[queryfilename]
  ] 
  if[opcode = `randtree
	/ here are input patterns only used for random testing
	treelabelsize:: #treelabel.treeid
	treesize:: #tree.treeid
	gensubtree[*treelabel.treeid]
	fillall'?tree.treeid
	treelabel.treeid:: treelabelsize _ treelabel.treeid
	treelabel.nodeid:: treelabelsize _ treelabel.nodeid
	treelabel.label:: treelabelsize _ treelabel.label
	tree.treeid:: treesize _ tree.treeid
	tree.parentid:: treesize _ tree.parentid
	tree.childrenid:: treesize _ tree.childrenid
  ]
  dbbuild: _t
  if[opcode = `"+f"
	/ query part
	fillall'?tree.treeid
	querytrees: ,*? tree.treeid
  ]
  if[opcode _in `"+d" `"no-op" `randtree / must make sure you get all the database
	/ in fact we always get an even number
	fillall'?tree.treeid / fill up pre and postorder and level
		/ not clear it's all needed
	querytrees: ? tree.treeid
  ]
 mm: 0
 outdumpmatch: ()
 suf.treeids:: ` $ treeid'*:'suf.starts 
 while[mm < (#querytrees) 
  qtree: querytrees[mm]
  suf.diffs:: (#suf.starts) # 0
  flag: containsvldc[qtree]
  if[~flag
    pair: producepaths[qtree]
    / pair[0], are labels
    / pair[1] etc are node ids
    / labelpaths: formlabnew'pair[0]
    labelpaths: pair[0]
	/ JULY 2001: may want to sort these by length.
	/ ??? I don't think I need to reorder pair[1]
    xcount: #:' labelpaths
    xii: > xcount
    labelpaths@: xii
    doneflag:: 0 / some trees still to look for
    x: findinsuf'labelpaths / find all the start nodes for labels
	/ In the original methods, we are looking for start nodes in
	/ the data tree and we are trying to see whether enough are 
	/ present in the data tree.
	/ For Cole method, the label paths should lead to members
	/ that correspond to a spine of the query tree, so the threshintersect
	/ should check for that.
    / out: multiintersect[x]
    out: ? threshintersect[x; maxdist]
    if[dumpmatchflag
      xx: ,/ locateinsuf[;labelpaths]'out
      jjj: & 0 < #:'xx
      outdumpmatch,: (,/'xx[jjj]) _dv\: `"#"
    ]
    if[~ dumpmatchflag
	out: determineroot'$,/out
    ]
  ]
  if[flag / contains vldc
	dontcareroots:: () / to be used for dumpmatchflag
	/ If root is a * then see whether subtrees match and if so,
	/ then you have a match. If root is a ?, then start with children
	/ and ask each child to give all matches (not just ones to root)
	/ and then see if any combination are all below a single question
	/ mark.
	/ Can partition those all by level.
	/ This means however that we can't have question mark over question
	/ mark or if so, we have to adjust the levels.
	/ But then what happens if there is a ? over a *.
	/ Another strategy: if you see a vldc or ?, 
	/ just make that be a root partition.
	/ Now it matches anything.
	/ Then just proceed as usual.
      pair1: findpartitions[qtree]
      qroot: pair1[0]
      qnodes: pair1[1]
      tempout: ()
      i: 0
      if[dontcareflag = 1 
		i: 1 / means that root is either ? or *, so start at next
		/ allow dontcares under this root node.
		/ If we had them, then could handle however by putting
		/ these same nodes elsewhere
		x: ,? (,//suf.starts) _dv `"#"
		tempout,: ,x
		/ maybe just find where the trees below match and 
		/ then go.
      ]
      alllabelpaths: () / set of labelpaths from each root of
	/ each non-don't care subtree
      while[i < #qroot
	pair: producepaths_partial[qtree; qroot[i]; qnodes[i],()]
        labelpaths: pair[0]
	/ JULY 2001: may want to sort these by length.
	/ ??? I don't think I need to reorder pair[1]
        xcount: #:' labelpaths
        xii: > xcount
        labelpaths@: xii
	alllabelpaths,: ,labelpaths
    	doneflag:: 0 / some trees still to look for
        x: findinsuf'labelpaths / find all the start nodes for labels
    	tempout,: ,? threshintersect[x; maxdist]
	i+: 1
      ]
      pair2: findtreeandroots'tempout
      xx: checkformatch[qtree; qroot; pair2[;0]; pair2[;1]]
      out: xx[0]
      outroots: xx[1]
      out: ($out) ,' ("_") ,/: ($outroots)
      nakedout: ,:' ` $ out
      out: determineroot'out
      if[dumpmatchflag 
	dontcarelocateinsuf:{[roots; alllabelpaths]
          xx: ,/ locateinsuf'[roots;alllabelpaths]
          jjj: & 0 < #:'xx
          out: ,/'xx[jjj]
	  :out
	}
	/ dontcareroots are all the roots matching portions of the
	/ dont care query tree. But these are ones that count.
	/ Each dontcareroots record is a sequence of roots of subtrees
	/ from the same tree.
	if[dontcareflag  / root of query is a dont care
          xx: ,/ dontcarelocateinsuf[;alllabelpaths]'1 _' dontcareroots
	  xx,: nakedout
	]
	if[~ dontcareflag  / root of query is not a dont care
          xx: ,/ dontcarelocateinsuf[;alllabelpaths]'dontcareroots
	]
	outdumpmatch,: xx _dv\: `"#"
      ]
  ]
  if[dumpqueryflag
     "dumpedquery" 0: dumptree[*?treelabel.treeid]
  ]
  / alldumps[0]
  / x
  if[&/(opcode = `"no-op"; 0 < #out)
	matched,: mm
  ]
  if[~ (opcode = `"no-op")
    text,: ,("Total time is "), ( $ _t - starttime), (" seconds.")
    if[opcode _in `"+b" `"+f" `randtree
      text,: ,("Database construction time is "), ( $ dbbuild - starttime), (" seconds.")
    ]
    if[opcode = `randtree
      text,: ,("Number of trees is "), ($numtrees)
      text,: ,("Size of trees is "), ($size)
    ]
    if[0 = #out
  	  text,: , ("There are no matches for the query tree: "), ($qtree)
    ]
    if[0 < #out
  	  text,: , ("Results for "), ($qtree), (" are: ")
	  / text,: spit[out @ < out]
    	  text,: :[dumpmatchflag 
		  ,/spit'[outdumpmatch @ < outdumpmatch] 
		  spit[out @ < out]]
    ]
  ]
  mm+: 1 / to skip to next tree 
 ]
 if[opcode = `"no-op"; :matched]
}

text: ()
if[opcode _in `randtree `"+f"
  doquery[]
]
output 0: text
/ If running as a web server, comment out the following line:
text
/ Joao, comment from here down.
if[~ opcode = `"+d"; . "\\\\"]

 / now take care of +d case
 aax: 1: dbfilename
 suf.paths: aax.paths
 suf.starts: aax.starts
 xx: dbfilename, ("stats")
 aay: 1: xx
 treestats.treeid: aay.treeid
 treestats.nodeid: aay.nodeid
 treestats.preid: aay.preid
 treestats.postid: aay.postid
 treestats.level: aay.level
 treestats.parent: aay.parent
 treestats.label: aay.label

 doquery[]
output 0: text

/ TIME TESTING HARNESS (REMOVE / if you want this)
/ .time.sum[]
/ xf: .TIME.f
/ xt: .TIME.t
/ xx: xf,'xt
/ zz: xx @ <xx[;1]
/ zz             

/ If running as a web server, comment out the following line:
/ text
\\