/ Code author: Dennis Shasha, 2000, 2001, 2002, 2003. / Joint work with Tsong-Li Wang and Kaizhong Zhang. / pathfix.k / k pathfix +q treeincaps +d treeindb / k pathfix +q treeincaps2 +d treeindb / k pathfix +q treeincaps3 +d treeindb +dumpmatch / To do: Have a hash table of size 998 (mod 997, which is a prime). / Take labels and hash them. Then take pairs of labels and hash them. / Create two tables: hashval is the index and then we have countlist and / treelist. Call ths file treecount. / Then we have tree and group list. Call this file treegroup. / So, then we can have a pathfilter program that will take a query / and will spit out pathfix queries for each tree. / / Second program that creates a single hash them. / March 2002: improve hash function / September, 2001: eliminate the preorder and postorder filter. / That depends on the position which we don't want. / April, 2001: in order to allow multiple children of a node with the same / label, just do a post-processing check on the tree that results. Still to do. / March 12, 2001: does a dumpmatch even in the case that we have don't cares / March 12, 2001: multiple levels of don't cares starting at the root. / Algorithm is to observe the first don't care at the root and / make that a partition. All other don't cares are eliminated. / The non-don't care part of the tree is partitioned into maximally sized / subtrees. / March 8 to do: dump matches. / For * or ? in the root of the query tree, perhaps split the problem / into subtrees at the very beginning by allowing any node in the data tree / to match the root and then go on. / March 8, 2001 solution: treat the root specially as a partition and have / it match anything. As of now, we cannot handle a dont care at the root / followed by a don't care below. / March 7, 2001: fixed vldc stuff so that it works even if there / are multiple vldcs one on top of another. / Optimization flag: do some preliminary filtering to reduce the time comparing / the suffix tree. February 2001. optflag: 0 / I don't know whether this helps, please toggle on and off. / / / February 2001 / Variable length don't cares (*) fixed length don't cares / (?) to support XPATH. / For fixed length don't cares, if you have no *, then / do a subsititution for the ? in place and verify length. / When it finds a match, it reports whether the match is to a root or not. / Algorithm: partition the query tree into connected subtrees / having no don't cares (so if a don't care is the parent of several / children, each child is the root of its own tree). / Match each of those. Then glue them together / to see where they match in the data tree. Gluing consists of / of the following: A node x in the data tree is the root of a subtree / that matches the query tree if the following holds: / The partition of the query tree containing the root of the query tree / matches the data tree and the root of the query tree r_all maps to x. / (This works if r_all is not a don't care.) / Consider the path p from / a non-root subtree root r_sub in the query tree to / r_all. Now r_sub maps to possibly many nodex x_sub_1, x_sub_2 etc. / in the data tree. There should be at least one, say x_sub_j such that / the path from x_sub_j to x matches (with * and ?) the path from r_sub / to r_all. / Pictures we need: query tree with vldcs and ?. / Partitions of that query tree. / Matches of that query tree in the data tree. / Paths from root of a partition (r_sub) to the root of the query tree. / Corresponding path from root of a data tree (x_sub_j) to a node x in / data tree to which r_all maps. / Just a note: / Xpath syntax a/b/c -- if anywhere. /a/b/c if starting at root. / a/b//c means variable length don't care between b and c. / a/b[d/e/f]/g for trees. / Output goes to data.out / To do: handle multiple query trees at once. Done January 2001. / This is a second way to search for trees inside of other trees. / The path fix techniques entails an encoding of all suffixes / of all paths and uses a suffix array. / This one answers queries slower (proportional to the number / of data trees, at least partially), but requires only linear storage. / To do: Which nodes in the target tree actually match? / At this point, each line in the output is a different tree. / 9/9/00: superoptimum turns out to be worse than just the clever point / that there is no reason to test two strings for a length longer than / the longest path length. / Another other approach (starting with an idea of Divesh) is to / make use of the fact that if we take the postorder and preorder / traversal of a query tree and each data tree, then / we can determine whether the query tree is "inside" the data tree / (order preserving way) / by seeing whether both the postorder and preorder traversals / of the query tree are gapped (i.e. not necessarily consecutive) substrings / of the data tree. / e.g. abc is a gapped substring of axybzwxcdef / The trouble is that this measure by itself allows the possibility / that a two level tree: / a / b / c / having postorder traversal: bca and preorder: abc / can match the following for example / a / v / b / c / having postorder traversal: bvca and preorder: avbc / So, it's as if there are arbitrary don't cares in the downward / arcs in the data tree. / To refine this, we extend this idea to find good candidates / for roots of subtrees within the data tree that may / exactly match the query tree. / The basic idea is to find those roots that contain all of the / data tree nodes as descendants. / Then there is a path comparison of the final tree. / Actually we have stopped implementation of the Divesh algorithm. / Sept 2000: Construct a suffix array of every path in each tree. / Take each data tree T, / Find all the root to leaf paths in T call them P_T / Then concatenate those paths with a # in between (not in any of the labels) / yielding C_T / Then form of a suffix array from C_T and then search on that. / The space is quadratic in the size of T. / The suffix array construction is / made more efficient by comparing only up to the maximum length / path (since comparisons involving the delimiter character # / are meaningless). / Other techniques are possible, e.g. using a smaller length and / then correcting after the fact. / TIME TESTING / \l time / NEW STUFF -- small data approach / return all the paths for this tree id starting at a root / In detail: each path is formed; let us say it is of length k. / Then there will be k node ids one for each starting point. / e.g. if there is a path from root to leaf in T consisting of letters / a b c and going through nodes 0 3 5, then the result of the path / will be (a, b, c), (T_a, T_b, T_c). / Then the suffix paths will be `c T_c, `bc T_b, `abc T_a. producepathsfromnode:{[treeid; root] i: & tree.treeid = treeid parent: tree.parentid[i] children: tree.childrenid[i] j: & treelabel.treeid = treeid labelnodeids: treelabel.nodeid[j] labellabels: treelabel.label[j] paths: extendpath[root; parent; children] labels: converttolabel[labellabels; labelnodeids]'paths :labels } / still use producepaths[treeid] and just use the labels which / are the first argument / END NEW STUFF / CREATE POSTORDER treepost.treeid: () treepost.postid: () treepost.nodeid: () treepost.label: () treepre.treeid: () treepre.preid: () treepre.nodeid: () treepre.label: () / for each node give its preorder number and postorder number and level / where 0 is the root treestats.treeid: () treestats.nodeid: () treestats.preid: () treestats.postid: () treestats.level: () treestats.parent: () treestats.label: () treestatsdb.treeid: () treestatsdb.nodeid: () treestatsdb.preid: () treestatsdb.postid: () treestatsdb.level: () treestatsdb.parent: () treestatsdb.label: () / fill the treestats table. / May not need this because we are going to try a different strategy / whereby we look at roots and see whether the paths between them / are the right ones. fillall:{[treeid] i: & tree.treeid = treeid parent: tree.parentid[i] children: tree.childrenid[i] j: & treelabel.treeid = treeid labelnodeids: treelabel.nodeid[j] labellabels: treelabel.label[j] treestats.treeid,: (#j) # treeid z: treelabel.nodeid[j] treestats.label,: labellabels[ 0 / up to qroots[0]. goodmatchold:{[qtree; qroots; qrootspre; qrootspost; targtreeslist; targrootslist; targtree] roots: () levels: () relroots_pre: () relroots_post: () i: 0 / for each qroot, find roots from tree targtree and convert to pre and post while[i < #qrootspre jj: & targtreeslist[i] = targtree x: targrootslist[i][jj] roots,: ,x levels,: ,converttolevel[targtree; x] relroots_pre,: ,converttopre[targtree;x] relroots_post,: ,converttopost[targtree;x] i+: 1 ] / filter out roots that can't be above others / So a root level can survive only if it is above at least one root / of each of the other trees. rootfilter:{[levels] rootlevs: levels[0] otherlevs: 1 _ levels highestother: &/ |/' otherlevs / take lowest of each subtree / and highest of the lowests x:& rootlevs < highestother :x } jj: rootfilter[levels] / get rid of roots that are impossible / trying to cut down cartesian product relroots_pre[0]@: jj relroots_post[0]@: jj roots[0]@: jj / End of levels part if[0 / preorder and postorder stuff is wrong as it depends on sibling order / first match preorder of each cartesian product cart_pre: constructcartprod[relroots_pre] goodones: & sameorder[qrootspre]'cart_pre / this is no good as it / depends on the order of the paths if[0 = #goodones; :0] / now if necessary match the postorders cart_post: constructcartprod[relroots_post] jj: & sameorder[qrootspost]'cart_post[goodones] / need only check those that match the preorder ] / cut out preorder and postorder stuff cart_root: constructcartprod[roots] / cart_root@: goodones[jj] / cut out preorder and postorder stuff / reorder these so first found start at root of targtree troot: findroot[targtree] mm: & cart_root[;0] = troot if[0 < #mm / roots first x: cart_root[mm] y: cart_root _di mm cart_root: x,y ] finalroots: () i: 0 / Now you want to match paths from each root of a subtree in the / data tree to some root in the data. / match holds a pair: query tree subtree root, target tree subtree root / if the path from the former to qroots[0] corresponds to the path / from the latter to its zero. match.qstart::() match.tstart::() match.tend::() while[i < #cart_root flag: comparepath[qtree; qroots; targtree; cart_root[i]] if[flag foundroots,: * cart_root[i],() if[dumpmatchflag dontcareroots,:,` $ ($targtree) ,/: ("_") ,/: ($cart_root[i]) ] :1 ] i+: 1 ] foundroots,: -33 :0 } / determines if some tree matches / Assume that qroots[0] is the root of qtree. / First filter out based on level, / then (now eliminated) preorder and postorder (eliminated because / it depends on sibling order). / Then trace each root associated with qroots[i] where i > 0 / up to qroots[0]. goodmatch:{[qtree; qroots; qrootspre; qrootspost; targtreeslist; targrootslist; targtree] roots: () levels: () i: 0 / for each qroot, find roots from tree targtree and convert to pre and post while[i < #qrootspre jj: & targtreeslist[i] = targtree x: targrootslist[i][jj] roots,: ,x levels,: ,converttolevel[targtree; x] i+: 1 ] / filter out roots that can't be above others / So a root level can survive only if it is above at least one root / of each of the other trees. rootfilter:{[levels] rootlevs: levels[0] otherlevs: 1 _ levels highestother: &/ |/' otherlevs / take lowest of each subtree / and highest of the lowests x:& rootlevs < highestother :x } jj: rootfilter[levels] / get rid of roots that are impossible / trying to cut down cartesian product roots[0]@: jj / End of levels part cart_root: constructcartprod[roots] troot: findroot[targtree] mm: & cart_root[;0] = troot if[0 < #mm / roots first x: cart_root[mm] y: cart_root _di mm cart_root: x,y ] finalroots: () i: 0 / Now you want to match paths from each root of a subtree in the / data tree to some root in the data. / match holds a pair: query tree subtree root, target tree subtree root / if the path from the former to qroots[0] corresponds to the path / from the latter to its zero. match.qstart::() match.tstart::() match.tend::() while[i < #cart_root flag: comparepath[qtree; qroots; targtree; cart_root[i]] if[flag foundroots,: * cart_root[i],() if[dumpmatchflag dontcareroots,:,` $ ($targtree) ,/: ("_") ,/: ($cart_root[i]) ] :1 ] i+: 1 ] foundroots,: -33 :0 } / Do paths to root match in the two trees? / If so, then return 1 else 0. / Side effect to match structure. comparepath:{[qtree; qroots; targtree; troots] numroots: #qroots i: 1 while[i < numroots jj: & (match.qstart = qroots[i]) & (match.tstart = troots[i]) & (match.tend = troots[0]) if[0 = #jj / not yet found x:pathmatch[qtree;qroots[i]; qroots[0]; targtree; troots[i]; troots[0]] if[x / save match match.qstart,: qroots[i] match.tstart,: troots[i] match.tend,: troots[0] ] if[~x; :0] / no match ] i+: 1 ] :1 } miniformold:{[x] y: ,/ ($x) ,\: ("_"); :y} miniform:{[x] if[~ (`"*") _in x : ,/ ($x) ,\: ("_") ] i: & x = `"*" if[0 _in i i: i _di 0 x[1]: ` $ ("*"),($x[1]) ] x[i-1]: ` $ ($x[i-1]) ,\: ("*") x: x _dv `"*" x: x _dv `"**" y: ,/ ($x) ,\: ("_") :y } / get all subsequences of list of length count formlists:{[count; list] out: () i: 0 while[(i < 1 + (#list) - count) :[(count > 1) out,: ,list[i],/: formlists[count-1; (i+1) _ list] out,: ,list[i] ] i+: 1 ] :out } / The query may have variable length don't cares / Match takes this into account. pathmatch:{[qtree;qsubroot; qallroot; targtree; tsubroot; tallroot] qp: findpath[qtree;qsubroot; qallroot] tp: findpath[targtree; tsubroot; tallroot] tps: miniform[tp] if[(`"?" _in qp) & (`"*" _in qp) / fill the ?s in wherever they are found / but do so in a way that allows all possible substitutions to * / So get all subsequences of tp of length number of ?s / and try each one as the substitution of ?. / Not optimal but not too bad either. / Have to do this because ? is replacing a symbol not a single letter. questpos: & qp = `"?" count: #questpos x: ,/formlists[count; tp] / get all subsequences of tp of / length count jj: 0 / Because of conversion to strings, we may allow some extra matches. while[jj < #x qptemp: qp qptemp[questpos]: x[jj] qptemps: miniform[qptemp] if[tps _sm qptemps; :1] jj+: 1 ] :0 ] if[(`"?" _in qp) & (~ `"*" _in qp) if[~ (#qp) = (#tp); :0] / fill the ?s in wherever they are found at their positions. questpos: & qp = `"?" qptemp: qp qptemp[questpos]: tp[questpos] / one to one substitutsion :qptemp ~ tp ] if[~ `"?" _in qp :tps _sm miniform[qp] ] } / find a path in a tree from a subnodeid to a supernodeid / These start out as symbols but end up as strings. / Express as a string. findpath:{[treeid; subnodeid; supernodeid] i: & treestats.treeid = treeid nodes: treestats.nodeid[i] pars: treestats.parent[i] labs: treestats.label[i] j: nodes ? subnodeid if[j = #nodes; !-3] out: labs[j] curnode: subnodeid curpar: pars[j] while[~ curnode = supernodeid j: nodes ? curpar out,: labs[j] curnode: curpar curpar: pars[j] ] :out } / construct the cartesian product of a list of lists constructcartprod:{[listoflists] numlists: #listoflists start: listoflists[0] i: 1 while[i < numlists start: ,/start ,/:\: listoflists[i] i+: 1 ] :start } / given two lists of node numbers (say both using preorder numbering) / return 1 iff the two lists are in the same order / e.g. if preorder numbering is 2 3 1 6 in one case and 3 7 5 8 in the other sameorder:{[list1; list2] (< list1) ~ ( size; :lists] if[2 > size; :*lists] first: lists[0],() jj: first ?/: (,/ ?:' lists[1+ !(size-1)]) / find indexes in first x: @[(1+#first) # 0; jj; + ; 1] x: (-1) _ x / delete missing entry kk: & x = size - 1 :first[kk] } / this is a set intersection so we remove duplicates / just like multiintersect except that the maxdist allows / a result if it is present in (#lists) - maxdist / of the input lists threshintersect:{[lists; maxdist] size: #lists if[2 > size; :lists] / first: lists[0],() / no good because might be empty first: (?,/lists) ,() / union / jj: first ?/: (,/ ?:' lists[1+ !(size-1)]) / find indexes in first jj: first ?/: (,/ ?:' lists) / find indexes in first x: @[(1+#first) # 0; jj; + ; 1] x: (-1) _ x / delete missing entry / kk: & x > size - (2 + maxdist) kk: & x > size - (1 + maxdist) :first[kk] } / A faster difference differ:{[x;y] x,: () y,: () i: y ?/: x j: & i = #y :?x[j] } / Count the difference differcount:{[x;y] i: y ?/: x :# & i = #y } / END OF SET STUFF / READ TREE BY CONVERSION FROM A FILE OF THE FORM / # tree|treeid|parentid|childrenid / treea|0|1 2 5 6 8 / treea|2|3 4 / treea|6|7 / / # treelabel|treeid|nodeid|label / treea|0|x / treea|1|y / treea|2|z / treea|3|w / treea|4|v / treea|5|q / treea|6|r / treea|7|p / treea|8|r inputfile: "treein" if[0 < #_i inputfile: _i[0] ] / Null values should be represented by having only blanks in a field. / Tables are separated by a single blank line / a list is numeric if each member is either empty, a period / or a digit isnumeric:{[list] s: ,/list nums: "1234567890" x: nums ?/: s y: (#nums) > |/x / if greater then everything in s met its match if[1 = y; :2] / 2 means integer nums: "1234567890." x: nums ?/: s :(#nums) > |/x / if greater then everything in s met its match / 1 means float and 0 means not a number } / parses a field based on vertical bars getfields:{[line] i: line = "|" j1: &i j2: &~i line @:j2 size: #j1 :(0,(j1 - !size)) _ line } / get rid of blanks at either end of the string delendblanks:{[string] if[0 = #string; :""] if[string ~ ,"" ; :""] i: & ~ string = " " if[(#string) = (#i); :string] if[0 = (#i); :""] string: (- ((#string) - (1 + *|i))) _ string :(*i) _ string } / Handles one line of input at a time according to table schema above. / A little more space efficient than the previous version. processline:{[line] emptyorblank:{[line] (0 = #delendblanks[line])} emptyflag: emptyorblank[line] if[~emptyflag justempty:: 0 if[line[0] = "#" newline: delendblanks'getfields[1 _ line] if[(1 < #newline) globaltable:: *newline alltables,: ` $ globaltable globalatt:: 1 _ newline globalvals:: () / initialize ] / newline ] / table declaration if[~ line[0] = "#" newline: delendblanks'getfields[line] if[(1 < #newline) | (0 < #newline[0]) globalvals,: ,newline ] ] / end of test on data line ] / end of test on line is non-empty if[emptyflag & (0 = justempty) justempty:: 1 / if[ ((#globalatt) > 1) | (~ (#globalatt) = (^globalvals)[0]) / globalvals:: + globalvals / take transpose / ] if[~ (#globalatt) = (#globalvals[0]); !-1] numericflag:() hh: 0 while[hh < #globalatt numericflag,: isnumeric[globalvals[;hh]] hh+: 1 ] inindex:: 0 while[inindex < #globalatt / table.att[2] :: ` $ vals[3] string1: globaltable, (".") string1,: globalatt[inindex] teststring: ("0 = #"), string1 / For big sizes, uncomment following two statements / User will get an error message, but it will work. / x: @[.: ; teststring; :] / currentlyempty: *x / error means empty / For big sizes, comment following currentlyempty: 1 / if empty, then assign, else append string2: :[currentlyempty; (":: "); (",: ")] string3: :[0 = numericflag[inindex] "` $ " :[1 = numericflag[inindex]; "0.0 $ "; "0 $ "]] string3,: "globalvals[;inindex]" . string1, string2, string3 if[1 = numericflag[inindex] globalindexes:: & 0N = . string1 string3: string1, ("[globalindexes] :: `") . string3 ] inindex+: 1 ] ] } inputfromfile:{[filename] a: 0: filename globalvals:: () oldglobalvals:: () / should not be necessary justempty:: 1 / as if we've just seen an empty line. should process table a,: ,,"" processline'a makeintlist:{[line] (. $line), ()} tree.childrenid:: makeintlist'tree.childrenid } / DISPLAY. spit:{[list] numtodisplay: 6 x: ($list) ,\: " " out: () while[0 < #x if[~ numtodisplay < #x out,: , (-1) _ ,/x x: () ] if[numtodisplay < #x out,: , (-1) _ ,/x[!numtodisplay] x: 6 _ x ] ] :out } / spit out a tree in indented notation. dumptree:{[treeid] out: ,("Tree: "), ($treeid) indent: 0 ikids: & tree.treeid = treeid parent: tree.parentid[ikids] children: tree.childrenid[ikids] ilab: & treelabel.treeid = treeid node: treelabel.nodeid[ilab] label: treelabel.label[ilab] if[0 < #parent out,: dump[parent; children; node; label; 0; 0] ] if[0 = #parent out,: ,($label) ] :out } / this is a recursive procedure / that does the actual printing where the level of indent is proportional / to the level. dump:{[parent; children; node; label; parentindex; indent] i: parentindex out: ,(indent # " "), ($label[node ? parent[i]]),(" ("),($parent[i]),(")") indent+: 2 j: 0 while[j < #children[i] kid: children[i;j] ii: parent ? kid if[ii < #parent out,: dump[parent; children; node; label; ii; indent] ] if[ii = #parent out,: ,(indent # " "), ($label[node ? kid]),(" ("),($kid),(")") ] j+: 1 ] :out } / END OF DISPLAY. / When filename is missing, then input data comes from tempin / When num is missing then an all-pairs test is done. / Otherwise the first num trees from the input data are compared against / all the others. / k pathfix / (+b | +q +d | +f ) / [+m (0)] / Program pathfix with the +b will build a database / from that file and will call it db with an extension / (either .l or .K depending on the operating system but you need not care). / Program pathfix with the +q queryfilename and +d dbfilename / will take the query file and the database file already produced (but / you need not specify the extension) and find out where the query is / in the database. / Program pathfix with the +f will take a file using / the normal tree format and will compare the first tree against all others. / +m maxdist is the maximum number of differences to allow. / +dumpquery says to dump the query tree in indented form / to the file dumpedquery / +dumpdb says to dump the database trees in indented form to the file dumpeddb / In this case, a difference is a path that is different. opcode: `"randtree" / only for testing but that's the default output: "data.out" maxdist: 0 filenamefordb: "treein" filename: "treein" dbfilename: "dbfile" / should have a .l or .K extension, but take this away queryfilename: "queryfile" dumpqueryflag: 0 dumpdbflag: 0 dumpmatchflag: 0 specifictreeflag: 0 / if 1 then we are looking for specific trees / and they will be placed in specifictreefile specifictreeindexes: () / to be used in findinsuf specifictreefile: "ljlkjl" specifictrees: () / list of trees to look at okflag: 1 processargs:{[args] i: 0 numinputs: 0 while[(i < #args) & okflag okflag:: 0 addextra: 0 if[args[i] ~ "+n" opcode:: `"no-op" okflag:: 1 ] if[args[i] ~ "+b" if[(i+1) < #args filenamefordb:: args[i+1] addextra: 1 numinputs+: 1 opcode:: `"+b" okflag:: 1 ] ] if[args[i] ~ "+q" if[(i+1) < #args queryfilename:: args[i+1] addextra: 1 numinputs+: 0.5 opcode:: `"+d" okflag:: 1 ] ] if[args[i] ~ "+d" if[(i+1) < #args dbfilename:: args[i+1] kk: dbfilename ? "." if[kk < #dbfilename; dbfilename@: !kk] addextra: 1 numinputs+: 0.5 opcode:: `"+d" okflag:: 1 ] ] if[args[i] ~ "+f" if[(i+1) < #args filename:: args[i+1] addextra: 1 numinputs+: 1 opcode:: `"+f" okflag:: 1 ] ] if[args[i] ~ "+o" if[(i+1) < #args output:: args[i+1] addextra: 1 numinputs+: 1 okflag:: 1 ] ] if[args[i] ~ "+s" if[(i+1) < #args specifictreefile:: args[i+1] specifictrees:: 1: specifictreefile specifictreeflag:: 1 addextra: 1 numinputs+: 1 okflag:: 1 ] ] if[args[i] ~ "+m" if[(i+1) < #args maxdist:: 0 $ args[i+1] addextra: 1 okflag:: 1 ] ] if[args[i] ~ "+dumpquery" dumpqueryflag:: 1 okflag:: 1 ] if[args[i] ~ "+dumpdb" dumpdbflag:: 1 okflag:: 1 ] if[args[i] ~ "+dumpmatch" / the matching tree dumpmatchflag:: 1 okflag:: 1 ] i+: 1 + addextra if[0 = okflag x: "format is k pathfix \n " x,: " (+b | +q +d | +f ) \n " x,: " [+m (0)] [+dumpquery] [+dumpdb] [+dumpmatch]\n " x,: " [+o ] \n " x,: " [+n ] \n " x,: " e.g. k pathfix +b treein \n " x,: " will form treeindb.l or treeindb.K \n" x,: " k pathfix +n \n" x,: " means load the functions by themselves.\n" x,: " k pathfix +q treeinquery +d treeindb \n" x,: " will find the query tree in treeindb.\n" x,: " k pathfix +q treeinquery +d treeindb +o fooout\n" x,: " will find the query tree in treeindb and put output in fooout.\n" x,: " k pathfix +q treeinquery +d treeindb +dumpquery +dumpmatch +m 2\n" x,: " will find the query tree in treeindb within distance 2, put the\n" x,: " query tree in indented form in the file dumpedquery,\n" x,: " and the matching trees in the file dumpedmatch.\n" x,: " k pathfix +f treein +dumpquery +dumpdb +m 2\n" x,: " will make the first tree in treein be the query tree;\n" x,: " the remaining trees in treein be the database trees; \n" x,: " will look for the query tree among the database trees\n" x,: " within distance 2; will put the query tree in\n" x,: " indented form in the file dumpedquery, and the\n" x,: " database trees in indented form in the file dumpedb.\n" x,: " NB. When using +q and +d, the database trees are not available to be dumped.\n" x,: " NB. Similarly when using +b, there is no query tree.\n" ` 0: x . "\\\\" ] ] } / STATISTICS avg:{(+/ x) % # x} var:{avg[_sqr x] - _sqr avg[x]} std:{_sqrt var[x]} cov:{avg[x * y] - avg[x] * avg[y]} corr:{ (cov[x;y])%((std[x]) * (std[y]))} / END OF STATISTICS / SUFFIX TREE PART / Construct a suffix array of a string and a tree_node id of the / beginning of the suffix. / This consists of forming all the suffixes, then then sort them. / Create an array that holds only the pointers to the array. / form suffixes and then spit out the array as symbols / with associated node positions. makesuffixold:{[string; treevec] cuts: !(#string) out: cuts _\: string j: s2 and -1 if s1 < s2 / assumes we have strings strcmp:{[s1; s2] if[s1 ~ s2; :0] if[((s1 < s2) ? 1) < ((s1 > s2) ? 1); :-1] :1 } strcmp:{[s1; s2] f1: (s1 < s2) ? 1 f2: (s2 < s1) ? 1 / if[f1 < f2; :-1] / if[f2 < f1; :1] / :0 : :[f1 < f2; -1; f2 < f1; 1; 0] } / are there count indexes between any two dividing points / assume indexes are sorted. isenough:{[indexes; divides; count; query; labelarray] while[0 < #divides ii: #& indexes < divides[0] if[~ count > ii; :1] / there are enough divides: 1 _ divides indexes: ii _ indexes ] :0 } foundit:{[indexes; divides; count; query; labelarray] out: () while[0 < #divides ii: & indexes < divides[0] mycount: #ii if[~ count > mycount if[(query ~ labelarray[indexes[ii]]) & (1 = |/ -': indexes[ii]) out,: *indexes[ii] ] ] divides: 1 _ divides indexes: mycount _ indexes ] :out } / Find out which roots have the query string / The query is in string notation. / suffixarray is a triple (allstring; sufarray; rootsufarray) / consisting of a string and indexes into that string / where nodes begin. / roots are the places in the tree where a path begins. / The trouble is that we need a sufarray and a rootsubarray because / the indexes for the roots are different from in the string since each root / represents only the beginning of a string. / DOESN'T QUITE WORK searchintreenew:{[query; suffixarray; rootsintree] flag: 1 if[optflag / try to filter out this part of suffixarray if[(1 = #query) ind: & suffixarray[0] = query[0] :rootsintree[ind] / that's all you need ] if[(1 < #query) / linear time on suffixarray[0] indexes: intersectleftindexes_multi[suffixarray[0]; query] indexes@: dlen / e.g. if testind = 8, qlen = 1, and dlen = 9, then ok x: strcmp[query; dbstring[testind+!qlen]] if[x = 1 / query is too big indexes: (1 + half) _ indexes rootindexes: (1 + half) _ rootindexes ] if[x = -1 indexes@: !half rootindexes@: !half ] if[x = 0 / match, may be one of several places / all are consecutive out,: roottestind jj: half+1 flag: 1 while[(jj < #indexes) & (flag) testind: indexes[jj] roottestind: rootindexes[jj] if[(testind + qlen) > dlen flag: 0 / no way ] if[~ (testind + qlen) > dlen x: strcmp[query;dbstring[testind+!qlen]] :[x = 0 out,: roottestind flag: 0] ] jj+: 1 ] jj: half-1 flag: 1 while[(jj > -1) & (flag) testind: indexes[jj] roottestind: rootindexes[jj] if[(testind + qlen) > dlen flag: 0 / no way ] if[~ (testind + qlen) > dlen x: strcmp[query;dbstring[testind+!qlen]] :[x = 0 out,: roottestind flag: 0] ] jj-: 1 ] indexes: () rootindexes: () ] ] if[(testind + qlen) > dlen endlen: dlen - testind / distance in dbstring x: strcmp[query[!endlen]; dbstring[testind+!endlen]] if[x > (-1) / even if 0, this is not a good position / since we had to shorten query / so real match should be greater indexes: (1 + half) _ indexes rootindexes: (1 + half) _ rootindexes ] if[~ x > (-1) indexes@: !half rootindexes@: !half ] ] ] :out } / END SUFFIX TREE PART / PATHFIX PROCESSING / take a path of numbers and extend it through all its children extendpath:{[numberpath; parent; children] i: parent ? *|numberpath if[i = #parent; :,numberpath] if[1 = #children[i] y: ,/extendpath[;parent; children]',numberpath,children[i] :y ] allpaths: numberpath,/:children[i] x: ,/extendpath[; parent;children]'allpaths :x } / list of labels formlab:{[x] :x} / assuming a symbol list is going to be extended to strings with _ in between, / here are the beginnings of the strings findbeginnings:{[symblist] slist: $symblist counts: 1 + #:'slist :0, +\ (-1) _ counts } / single label that is one long label / formlabnew:{[x] y: (-1) _ ,/ ($x) ,\: ("_"); :y} formlabnew:{[x] y: ("_") , (-1) _ ,/ ($x) ,\: ("_"); :y} / do want initial _ but not trailing one / take a path of node ids and find the corresponding labels converttolabel: {[labellabels; labelnodeids; path] i: intersectleftindexes[labelnodeids; path,()] :formlab[labellabels[i]] } / take a path of node ids and find the corresponding labels addtreeid:{[treeid;path] : ` $ ($treeid) ,/: ("_") ,/: ($path) } / return all the paths for this tree id / and then convert this to a suffix tree / In detail: each path is formed; let us say it is of length k. / Then there will be k node ids one for each starting point. / e.g. if there is a path from root to leaf in T consisting of letters / a b c and going through nodes 0 3 5, then the result of the pah / will be (a, b, c), (T_a, T_b, T_c). / Then the suffix paths will be `c T_c, `bc T_b, `abc T_a. producepaths:{[treeid] i: & tree.treeid = treeid parent: tree.parentid[i] children: tree.childrenid[i] j: & treelabel.treeid = treeid labelnodeids: treelabel.nodeid[j] labellabels: treelabel.label[j] root: * differ[parent; ,/children] paths: extendpath[root; parent; children] / For Cole method, must extend from every node along the / spine of the query tree. labels: converttolabel[labellabels; labelnodeids]'paths newpaths: :[suffixarray; addtreeid[treeid]'paths; paths] / append numbers to trees or not / x: makesuffix'[labels; newpaths] / :x :(labels;newpaths) } / In case where we have variable length don't cares in query tree, / form a partial tree given a bunch of nodes. Label this tree by its root. / Otherwise, algorithm is similar to producepaths. producepaths_partial:{[treeid; root; nodes] i: & tree.treeid = treeid parent: tree.parentid[i] children: tree.childrenid[i] mm: intersectleftindexes_multi[parent;nodes] parent@: mm children@: mm children: intersect[nodes]'children / don't go beyond certain children counts: #:' children mm2: & 0 < counts parent@: mm2 children@: mm2 j: & treelabel.treeid = treeid labelnodeids: treelabel.nodeid[j] labellabels: treelabel.label[j] paths: extendpath[root; parent; children] labels: converttolabel[labellabels; labelnodeids]'paths newpaths: :[suffixarray addtreeid[` $ ($treeid), ("+"), ($root)]'paths paths] / append numbers to trees or not / x: makesuffix'[labels; newpaths] / :x :(labels;newpaths) } / cut up a tree having vldcs and question marks. findpartitions:{[treeid] partroot:: () / have a root for each partition partnodes:: () / and a bunch of nodes i: & tree.treeid = treeid parent: tree.parentid[i] children: tree.childrenid[i] i: & treelabel.treeid = treeid nodes: treelabel.nodeid[i] labels: treelabel.label[i] root: * differ[parent; ,/children] partroot,: root / this can be a *??? partnodes,: ,root rootlab: $ labels[nodes ? root] dontcareflag:: ("*") = rootlab[0] dontcareflag|: ("?") = rootlab[0] if[~dontcareflag recurfindpartitions[root; parent; children; nodes; labels; root] ] if[dontcareflag / root is a don't care j: parent ? root kids: children[j] kids: ,//finddontcare[parent;children;nodes;labels]'kids partroot,: kids partnodes,: ,:'kids / assumes no dontcares among immediate children of root / (march8) / If we wanted that, then we could recursively look for kids / that are don't cares and make them further separate partitions / by putting them in partroot and partnodes. Then we would / also need to change other places where we handle dontcareflag. recurfindpartitions[; parent; children; nodes; labels; ]'[kids;kids] ] :(partroot; partnodes) } / find the non-don't cares underneath mynode finddontcare:{[parent;children;nodes;labels; mynode] out: () j: nodes ? mynode mylab: $labels[j] flag: ("*") ~ mylab[0] flag|: ("?") ~ mylab[0] if[~ flag :mynode ] if[flag jj: parent ? mynode if[jj < #parent mykids: children[jj] out,:,/finddontcare[parent;children;nodes;labels]'mykids ] ] :out } / for a tree that has vldcs, divide up the tree / by identifying root and partitions / This is starting a new one at root. recurfindpartitions:{[currentroot; parent; children; nodes; labels; next] j: parent ? next if[j = #parent; :,(root;root)] / root is by itself mm: partroot ? currentroot kids: children[j] i: 0 labelind: nodes ?/: kids while[i < #kids kidlab: $ labels[labelind[i]] flag: ("*") = kidlab[0] / we don't care about vldcs having counts flag|: ("?") = kidlab[0] / we don't care about vldcs having counts / if we want to match *[3] meaning three characters, / then put in three question marks. if[flag jj: parent ? kids[i] / want children of this if[jj < #parent grandkids: children[jj] kk: 0 / Each grandkid starts its own tree which / is what we want. / If a grandkid is a don't care then skip to next gen while[kk < #grandkids x: nodes ? grandkids[kk] kidlab: $ labels[x] flag2: ("*") = kidlab[0] flag2|: ("?") = kidlab[0] if[~flag2 partroot,: grandkids[kk] partnodes,: ,grandkids[kk] recurfindpartitions[grandkids[kk]; parent; children; nodes; labels; grandkids[kk]] ] if[flag2 / skip to next generation jj1: parent ? grandkids[kk] if[jj1 < #parent grandkids,: children[jj1] ] ] kk+: 1 ] ] ] if[~ flag partnodes[mm],: kids[i] recurfindpartitions[currentroot; parent; children; nodes; labels; kids[i]] ] i+: 1 ] :out } / for a tree that has vldcs, divide up the tree / by identifying root and partitions / This is starting a new one at root. recurfindpartitionsold:{[currentroot; parent; children; nodes; labels; next] j: parent ? next if[j = #parent; :,(root;root)] / root is by itself mm: partroot ? currentroot kids: children[j] i: 0 labelind: nodes ?/: kids while[i < #kids kidlab: $ labels[labelind[i]] flag: ("*") = kidlab[0] / we don't care about vldcs having counts flag|: ("?") = kidlab[0] / we don't care about vldcs having counts / if we want to match *[3] meaning three characters, / then put in three question marks. if[flag jj: parent ? kids[i] / want children of this if[jj < #parent grandkids: children[jj] kk: 0 while[kk < #grandkids partroot,: grandkids[kk] partnodes,: ,grandkids[kk] recurfindpartitions[grandkids[kk]; parent; children; nodes; labels; grandkids[kk]] kk+: 1 ] ] ] if[~ flag partnodes[mm],: kids[i] recurfindpartitions[currentroot; parent; children; nodes; labels; kids[i]] ] i+: 1 ] :out } / Sept 2000, perhaps don't include the rtree prefix it takes too much space. / What this produces is a pair consisting / of a string of all the paths in the tree, / a list of start locations within that string such that the list / is lexicographically ordered. / Following the pair are the locations according to preorder where / the path begins. suffixproducepaths:{[treeid] pair: producepaths[treeid] if[0 pair[0]: (`a `b `c `a `d `a `e `f) pair[1]: (`"0" `"1" `"2" `"0" `"3" `"0" `"4" `"5") ] / :makesuffixold'[pair[0]; pair[1]] / sufarrays: makesuffix'pair[0] / roots: pair[1] @' sufarrays maxlen: |/#:' pair[0] / maximum length of any reasonable suffix / that is any suffix not stopping at a # allsymbs: ,/ pair[0] ,\: (`"#") allstring: formlabnew[allsymbs] / string with _ in between stringind: findbeginnings[allsymbs] rootsufarray: makesuffix[maxlen; allsymbs] / done at the level of symbols / gives an array in lexicographic order x: ,/ pair[1] ,\: `"#" / pair[1] are locations in the tree of nodes / either in preorder or someorder :((allsymbs; rootsufarray); x) / early return / roots: x @ rootsufarray / roots according to lexicographic order of paths roots: x / don't rearrange roots, since you will get a pointer. sufarray: stringind[rootsufarray] numstops: # pair[0] / all of the following cuts assume that the delimiter comes first!! / possible bug in the future ??? sufarray: numstops _ sufarray rootsufarray: numstops _ rootsufarray / note, don't take away initial roots (the ones that match the delimiter) / to ensure that rootsufarray still works. :( (allstring; sufarray; rootsufarray); roots) } / extend a suffix tree given paths and start nodes buildsuffixtree:{[treeid; paths; startlists] buildone[treeid]'[,/paths;,/startlists] } / actually build the suffix tree / Start at the anchor and put the startlist there and then descend. / suffixtree.nextchars: () / an array of suffix tree chars / suffixtree.nextnodes: () / an array of suffix tree nodes / suffixtree.startnodes: () / nodes in the original trees / suffixtree.starttrees: () / original tree name / the branch from the root of the suffix tree / anchorsuffixtree.nextchar: () / first char of a path / anchorsuffixtree.nextnode: () / next place to look buildone:{[treeid; path; startlist] p: path first: *p numstarts: #startlist ii: anchorsuffixtree.nextchar ? first if[ii = #anchorsuffixtree.nextchar / not present anchorsuffixtree.nextchar,: first globalnextnode+: 1 anchorsuffixtree.nextnode,: globalnextnode / initialize point in suffix tree suffixtree.nextchars,: ,() suffixtree.nextnodes,: ,() suffixtree.startnodes,: , () suffixtree.starttrees,: , () ] nextindex: anchorsuffixtree.nextnode[ii] p: 1 _ p while[0 < #p first: *p chars: suffixtree.nextchars[nextindex] kk: chars ? first if[kk = #chars suffixtree.nextchars[nextindex],: first globalnextnode+: 1 suffixtree.nextnodes[nextindex],: globalnextnode / initialize point in suffix tree suffixtree.nextchars,: ,() suffixtree.nextnodes,: ,() suffixtree.startnodes,: , () suffixtree.starttrees,: , () newindex: globalnextnode ] if[kk < #chars newindex: suffixtree.nextnodes[nextindex; kk] ] nextindex: newindex p: 1 _ p ] suffixtree.startnodes[nextindex],: startlist suffixtree.starttrees[nextindex],: numstarts # treeid } / suffixtree.nextchars: () / an array of suffix tree chars / suffixtree.nextnodes: () / an array of suffix tree nodes / the branch from the root of the suffix tree / anchorsuffixtree.nextchar: () / first char of a path / anchorsuffixtree.nextnode: () / next place to look / anchorsuffixtree.startnodes: () / nodes in the original trees / anchorsuffixtree.starttrees: () / original tree name reorganizesuffixtree:{[] i: < anchorsuffixtree.nextchar anchorsuffixtree.nextnode@: i anchorsuffixtree.nextchar@: i jj: 0 while[jj < #suffixtree.nextchars if[1 < #suffixtree.nextchars[jj] x: < suffixtree.nextchars[jj] suffixtree.nextchars[jj]@: x suffixtree.nextnodes[jj]@: x / don't need to change other suffix tree terms ] jj+: 1 ] } / is s1 a prefix of s2 isprefix:{[s1; s2] if[(#s1) > #s2; :0] :s1 ~ s2[!#s1] } / find indexes of local superstrings findsuperstrings:{[paths] jj: 0 supers: () while[jj < (#paths) - 1 if[~ prefix[paths[jj]; paths[jj+1]] supers,: jj ] jj+: 1 ] supers,: jj / last one } / creates structure of form (path; endpositions in path; starts / for those end positions) reducetolongest:{[paths ; starts] outpath: () outpositions: () outstarts: () superstrings: findsuperstrings[paths] groups: superstrings _ !#paths jj: 0 while[jj < #groups outpath,: ,paths[*|groups[jj]] jj+: 1 ] :(outpath; outpositions; outstarts) } / END OF PATHFIX tree / SUFFIX ARRAY SEARCH / Suffix array is in suf / At this point we handle wild cards separately. / So we won't see any here. / JULY 2001: 1. If we have filter information, then look only in / parts of the suffix tree where the relevant tree is present. / 2. If we know the threshold and a particular path is not / present, then mark that tree as of distance at least 1, so / perhaps need not be looked at. / Sort the query paths by length to take full advantage of this. / suf.starts has the trees findinsuf:{[labelpath] if[doneflag; :()] / we know there are no matches if[specifictreeflag if[0 = #specifictreeindexes specifictreeindexes:: intersectleftindexes[suf.treeids; specifictrees] ] ii: specifictreeindexes x: searchintree[labelpath]'[suf.paths[ii]; suf.starts[ii]] counts: #:' x jj: & counts = 0 / no matches suf.diffs[ii[jj]]+: 1 / kk: & suf.diffs > maxdist / specifictrees:: differ[specifictrees; suf.treeids[kk]] kk: & suf.diffs[ii[jj]] > maxdist / only these can change if[0 < #kk / specifictrees:: differ[specifictrees; suf.treeids[ii[jj[kk]]]] xx: differ[ii; ii[jj[kk]]] specifictreeindexes:: xx if[0 = #xx; doneflag:: 1] ] ] if[~ specifictreeflag x: searchintree[labelpath]'[suf.paths; suf.starts] ] :?,/x } / given that we know where there a root that matches, find the / specific nodes that match locateinsufold:{[rootnodes; labelpath] x: locateintreeold[labelpath; rootnodes]'[suf.paths; suf.starts] :x } treeid:{[x] y: $x; i: y ? "_"; :y[!i]} locateinsuf:{[rootnode; labelpaths] rootnode: * rootnode,() / make sure it's not a list treeid:{[x] y: $x; i: y ? "_"; :y[!i]} i: ((treeid[rootnode]) ~/: (treeid'*:'suf.starts)) ? 1 x: locateintree[labelpaths; rootnode;suf.paths[i]; suf.starts[i]] :x } / END OF SUFFIX ARRAY SEARCH / FILTER FUNCTIONS / Update the filter tree counts and tree grups / currently based on labels and label pairs. updatetreecount:{[mytree; group] / Now put in groups mmm: treegroup.tree ? mytree flag: mmm < #treegroup.tree if[flag treegroup.groups[mmm],: group ] if[~ flag treegroup.groups,: ,group treegroup.tree,: mytree ] / Now do treecount. mmm: & mytree _in/: treecount.trees mmm1: treecount.trees[mmm] ?\: mytree / positions in each of those if[0 < #mmm qq:0 while[qq < #mmm treecount.counts[mmm[qq];mmm1[qq]]:: 0 / reset counts to 0. qq+: 1 ] ] j: & treelabel.treeid = mytree labelnodeids: treelabel.nodeid[j] labellabels: treelabel.label[j] i: & tree.treeid = mytree parent: tree.parentid[i] parentlab: labellabels[labelnodeids ?/: parent] children: tree.childrenid[i] counts: #:' children newpars: ,/counts #' parentlab newchil: labellabels[labelnodeids ?/: ,/children] parchil: ($newpars) ,' ("_") ,/: ($newchil) x: parchil, $labellabels part: = x counts: #:' part uniqs: ?x hashinsert[mytree]'[uniqs; counts] } / BIG CHANGE / upsert count of mytree in the data structure hashinsert:{[mytree; string; count] x: _ic'string internalstate: 0 while[4 < #x internalstate+: */x[!4] internalstate: internalstate ! modval x: 4 _ x ] internalstate+: */x i: internalstate ! modval xxx: treecount.trees[i] jj: xxx ? mytree flag: jj < #xxx if[flag treecount.counts[i;jj]+: count ] if[~ flag treecount.trees[i],: mytree treecount.counts[i],: count ] } hashinsert:{[mytree; string; count] x: 31 _sv _ic string i: x ! modval xxx: treecount.trees[i] jj: xxx ? mytree flag: jj < #xxx if[flag treecount.counts[i;jj]+: count ] if[~ flag treecount.trees[i],: mytree treecount.counts[i],: count ] } / END OF FILTER FUNCTIONS / SUFFIX TREE SEARCH / find a path in the suffix tree / suffixtree.nextchars: () / an array of suffix tree chars / suffixtree.nextnodes: () / an array of suffix tree nodes / suffixtree.startnodes: () / nodes in the original trees / suffixtree.starttrees: () / original tree name / the branch from the root of the suffix tree / anchorsuffixtree.nextchar: () / first char of a path / anchorsuffixtree.nextnode: () / next place to look findinsuffixtree:{[labelpath] first: *labelpath outnode: () / nodes that match outtree: () / trees that match i: anchorsuffixtree.nextchar _bin first if[i = #anchorsuffixtree.nextchar; :(outnode; outtree)] if[~ first = anchorsuffixtree.nextchar[i]; :(outnode; outtree)] nextindex: anchorsuffixtree.nextnode[i] p: 1 _ labelpath while[0 < #p first: *p i: suffixtree.nextchars[nextindex] _bin first if[i = # suffixtree.nextchars[nextindex]; :(();())] if[~ first = suffixtree.nextchars[nextindex;i]; :(();())] nextindex: suffixtree.nextnodes[nextindex;i] p: 1 _ p ] allnodes: ?,/finddesc[nextindex] / find all descendant indexes x:(,/suffixtree.starttrees[allnodes]; ,/suffixtree.startnodes[allnodes]) :x } / find descendant nodes from node id finddesc:{[nodeid] out: nodeid out,: ,/ finddesc'suffixtree.nextnodes[nodeid] :out } / END OF SUFFIX TREE SEARCH / COMPRESSION / Go along the suffix tree and find a prefix that gives / max compression: at least a minimum length minlength, / First node is a huge array with all strings under minlength / and their corresponding stop points. / If a string in the first node is too short, then there / is a forward pointer to the next node. / GENERATE DATA / generate kids and fill in data structures genkids:{[treeid; labels; nodeids; rootlabel; root] treelabel.treeid,: treeid treelabel.nodeid,: root treelabel.label,: rootlabel numlabels: #labels choices: (numlabels) & (#nodeids) if[0 = choices; :()] kids: nodeids[choices _draw -#nodeids] x: *1 _draw #kids / take only some of those kids@: !x if[0 = #kids; :nodeids] kidlabels: labels[(#kids) _draw -numlabels] tree.treeid,: treeid tree.parentid,: root tree.childrenid,: ,kids remaining: differ[nodeids; kids] / not available to kids i: 0 while[i < #kids remaining: genkids[treeid; labels; remaining; kidlabels[i]; kids[i]] i+: 1 ] :remaining } / generate a tree by putting data into / tree(`childrenid `treeid `parentid) / and treelabel(`treeid `nodeid `label). gentree:{[size; treeid] labels: `a1 `b1 `c1 `d1 `e1 numlabels: #labels nodeids: !size root: nodeids[*1 _draw #nodeids] rootlabel: labels[* 1 _draw #labels] nodeids: nodeids _dv root genkids[treeid; labels; nodeids; rootlabel; root] } / generate kids of a subtree genkidssubtree:{[treeid; labels; nodes; rootlabel; root; parents; children] treelabel.treeid,: treeid treelabel.nodeid,: root treelabel.label,: rootlabel i: parents ? root if[i = #parents; :()] kids: children[i] x: 1 + *1 _draw #kids / take only some of those, but at least 1 kids@: !(x) if[0 = #kids; :()] ii: intersectleftindexes[nodes; kids] kidlabels: labels[ii] tree.treeid,: treeid tree.parentid,: root tree.childrenid,: ,kids i: 0 while[i < #kids genkidssubtree[treeid; labels; nodes; kidlabels[i]; kids[i]; parents; children] i+: 1 ] :() } / gen an interior subtree of a tree / tree(`childrenid `treeid `parentid) / and treelabel(`treeid `nodeid `label). gensubtree:{[treeid] j: & tree.treeid = treeid if[0 = #j; :()] parents: tree.parentid[j] children: tree.childrenid[j] k: & treelabel.treeid = treeid labelnodes: treelabel.nodeid[k] labellabels: treelabel.label[k] j1: * 1 _draw #parents n: parents[j1] j: labelnodes ? n root: labelnodes[j] rootlabel: labellabels[j] newtreeid: ` $ ($treeid), ("sub") genkidssubtree[newtreeid; labellabels; labelnodes; rootlabel; root; parents; children] } / DATA treelabel.treeid: () treelabel.nodeid: () treelabel.label: () tree.treeid: () tree.parentid: () tree.childrenid: () / in this new version of the suffix tree, we have a / node in the suffix tree, an array of nextlet / an array of nextnode / and an array of start nodes from the original tree suffixtree.nextchars: () / an array of suffix tree chars suffixtree.nextnodes: () / an array of suffix tree nodes suffixtree.startnodes: () / nodes in the original trees suffixtree.starttrees: () / original tree name / the branch from the root of the suffix tree anchorsuffixtree.nextchar: () / first char of a path anchorsuffixtree.nextnode: () / next place to look globalnextnode: -1 / global index of the next node. / EXECUTION / TIME TESTING HARNESS (REMOVE / if you want this) / .time.set`.k / start: _t / END OF TIME TESTING HARNESS (REMOVE / if you want this) text:() starttime: _t processargs[_i] if[(opcode = `randtree) numtrees: 10 size: 50 treeids: ` $ ("rtree") ,/: ($!numtrees) gentree[size]' treeids / gensubtree'treeids text,: ,"Input from random tree generation. " ] if[opcode = `"+b" inputfromfile[filenamefordb] text,: ,("Input from build file "), filenamefordb ] if[(opcode = `"+d" ) text,: ,("Input from database file "), dbfilename text,: ,(" and from query file "), queryfilename ] if[opcode = `"+f" inputfromfile[filename] text,: ,("Input from file "), filename, (" whose first tree is the query") ] if[dumpdbflag if[opcode = `"+b" "dumpeddb" 0: ,/dumptree'?treelabel.treeid ] if[opcode = `"+f" "dumpeddb" 0: ,/dumptree'(1 _ ?treelabel.treeid) ] ] suffixarray: 1 if[(1 = suffixarray) & (opcode _in `"+b" `"+f" `randtree) / form suffix array if[opcode _in `"+b" `randtree uniqtrees: ? tree.treeid / this part is for the +b case modval: 997 treecount.trees: (modval) # ,() / list of trees at a hash loc treecount.counts: (modval) # ,() / list of counts at a hash lo treegroup.tree: () treegroup.groups: () / this is a list xx: ` $ filenamefordb, ("db") / x1: @[.:; "xtreecount: 1: \"treecount\""; :] x1: @[.:; "1: \"treecount\""; :] if[0 = x1[0]; pair1: . x1[1]] / x2: @[.:; "xtreegroup: 1: \"treegroup\""; :] x2: @[.:; "1: \"treegroup\""; :] if[0 = x2[0]; pair2: . x2[1]] if[0 = x1[0] / there was a treecount treecount.trees: pair1[0][1] treecount.counts: pair1[1][1] treegroup.tree: pair2[0][1] treegroup.groups: pair2[1][1] ] ] if[opcode = `"+f" uniqtrees: 1 _ ? tree.treeid ] allpaths: () allstarts: () jj: 0 while[jj < #uniqtrees mytree: uniqtrees[jj] pair: suffixproducepaths[mytree] / for each tree produce suffixes allpaths,: ,pair[0] / path in symbols allstarts,: ,pair[1] / path in tree ids if[opcode = `"+b" / Add elements from mytree to the treecount table. / You will process these with +q. updatetreecount[mytree; xx] ] jj+: 1 ] / part: = allpaths / unneeded suf.starts: allstarts allstarts: () / cleanup suf.paths: allpaths if[opcode = `"+b" xx: filenamefordb, ("db") xx 1: suf zx: fillall'?tree.treeid xx: xx, ("stats") xx 1: treestats "treecount" 1: treecount / filtering stuff "treegroup" 1: treegroup ] / triples: reducetolongest[suf.paths ; suf.starts] / creates structure of form (path; endpositions in path; starts / for those end positions) / compressedsuf.path: triples[;0] / compressedsuf.positions: triples[;1] / compressedsuf.starts: triples[;2] / suffix array formed ] / suffix tree doesn't follow the options probably since i think it's too slow if[(0 = suffixarray) & (opcode _in `"+b" `"+f") if[opcode = `"+b" treelist: ? tree.treeid ] if[opcode = `"+f" treelist: 1 _ ? tree.treeid ] while[0 < #treelist mytree: *treelist pair: suffixproducepaths[mytree] paths: pair[;0] starts: pair[;1] part: = paths newstarts: ,/' starts[part] newpaths: ? paths i: < newpaths buildsuffixtree[mytree; newpaths[i]; newstarts[i]] / put data in suffixtree treelist: 1 _ treelist ] reorganizesuffixtree[] / sort everything in the database ] / query part doquery:{[] matched: () if[ opcode = `"+d" inputfromfile[queryfilename] ] if[opcode = `randtree / here are input patterns only used for random testing treelabelsize:: #treelabel.treeid treesize:: #tree.treeid gensubtree[*treelabel.treeid] fillall'?tree.treeid treelabel.treeid:: treelabelsize _ treelabel.treeid treelabel.nodeid:: treelabelsize _ treelabel.nodeid treelabel.label:: treelabelsize _ treelabel.label tree.treeid:: treesize _ tree.treeid tree.parentid:: treesize _ tree.parentid tree.childrenid:: treesize _ tree.childrenid ] dbbuild: _t if[opcode = `"+f" / query part fillall'?tree.treeid querytrees: ,*? tree.treeid ] if[opcode _in `"+d" `"no-op" `randtree / must make sure you get all the database / in fact we always get an even number fillall'?tree.treeid / fill up pre and postorder and level / not clear it's all needed querytrees: ? tree.treeid ] mm: 0 outdumpmatch: () suf.treeids:: ` $ treeid'*:'suf.starts while[mm < (#querytrees) qtree: querytrees[mm] suf.diffs:: (#suf.starts) # 0 flag: containsvldc[qtree] if[~flag pair: producepaths[qtree] / pair[0], are labels / pair[1] etc are node ids / labelpaths: formlabnew'pair[0] labelpaths: pair[0] / JULY 2001: may want to sort these by length. / ??? I don't think I need to reorder pair[1] xcount: #:' labelpaths xii: > xcount labelpaths@: xii doneflag:: 0 / some trees still to look for x: findinsuf'labelpaths / find all the start nodes for labels / In the original methods, we are looking for start nodes in / the data tree and we are trying to see whether enough are / present in the data tree. / For Cole method, the label paths should lead to members / that correspond to a spine of the query tree, so the threshintersect / should check for that. / out: multiintersect[x] out: ? threshintersect[x; maxdist] if[dumpmatchflag xx: ,/ locateinsuf[;labelpaths]'out jjj: & 0 < #:'xx outdumpmatch,: (,/'xx[jjj]) _dv\: `"#" ] if[~ dumpmatchflag out: determineroot'$,/out ] ] if[flag / contains vldc dontcareroots:: () / to be used for dumpmatchflag / If root is a * then see whether subtrees match and if so, / then you have a match. If root is a ?, then start with children / and ask each child to give all matches (not just ones to root) / and then see if any combination are all below a single question / mark. / Can partition those all by level. / This means however that we can't have question mark over question / mark or if so, we have to adjust the levels. / But then what happens if there is a ? over a *. / Another strategy: if you see a vldc or ?, / just make that be a root partition. / Now it matches anything. / Then just proceed as usual. pair1: findpartitions[qtree] qroot: pair1[0] qnodes: pair1[1] tempout: () i: 0 if[dontcareflag = 1 i: 1 / means that root is either ? or *, so start at next / allow dontcares under this root node. / If we had them, then could handle however by putting / these same nodes elsewhere x: ,? (,//suf.starts) _dv `"#" tempout,: ,x / maybe just find where the trees below match and / then go. ] alllabelpaths: () / set of labelpaths from each root of / each non-don't care subtree while[i < #qroot pair: producepaths_partial[qtree; qroot[i]; qnodes[i],()] labelpaths: pair[0] / JULY 2001: may want to sort these by length. / ??? I don't think I need to reorder pair[1] xcount: #:' labelpaths xii: > xcount labelpaths@: xii alllabelpaths,: ,labelpaths doneflag:: 0 / some trees still to look for x: findinsuf'labelpaths / find all the start nodes for labels tempout,: ,? threshintersect[x; maxdist] i+: 1 ] pair2: findtreeandroots'tempout xx: checkformatch[qtree; qroot; pair2[;0]; pair2[;1]] out: xx[0] outroots: xx[1] out: ($out) ,' ("_") ,/: ($outroots) nakedout: ,:' ` $ out out: determineroot'out if[dumpmatchflag dontcarelocateinsuf:{[roots; alllabelpaths] xx: ,/ locateinsuf'[roots;alllabelpaths] jjj: & 0 < #:'xx out: ,/'xx[jjj] :out } / dontcareroots are all the roots matching portions of the / dont care query tree. But these are ones that count. / Each dontcareroots record is a sequence of roots of subtrees / from the same tree. if[dontcareflag / root of query is a dont care xx: ,/ dontcarelocateinsuf[;alllabelpaths]'1 _' dontcareroots xx,: nakedout ] if[~ dontcareflag / root of query is not a dont care xx: ,/ dontcarelocateinsuf[;alllabelpaths]'dontcareroots ] outdumpmatch,: xx _dv\: `"#" ] ] if[dumpqueryflag "dumpedquery" 0: dumptree[*?treelabel.treeid] ] / alldumps[0] / x if[&/(opcode = `"no-op"; 0 < #out) matched,: mm ] if[~ (opcode = `"no-op") text,: ,("Total time is "), ( $ _t - starttime), (" seconds.") if[opcode _in `"+b" `"+f" `randtree text,: ,("Database construction time is "), ( $ dbbuild - starttime), (" seconds.") ] if[opcode = `randtree text,: ,("Number of trees is "), ($numtrees) text,: ,("Size of trees is "), ($size) ] if[0 = #out text,: , ("There are no matches for the query tree: "), ($qtree) ] if[0 < #out text,: , ("Results for "), ($qtree), (" are: ") / text,: spit[out @ < out] text,: :[dumpmatchflag ,/spit'[outdumpmatch @ < outdumpmatch] spit[out @ < out]] ] ] mm+: 1 / to skip to next tree ] if[opcode = `"no-op"; :matched] } text: () if[opcode _in `randtree `"+f" doquery[] ] output 0: text / If running as a web server, comment out the following line: text / Joao, comment from here down. if[~ opcode = `"+d"; . "\\\\"] / now take care of +d case aax: 1: dbfilename suf.paths: aax.paths suf.starts: aax.starts xx: dbfilename, ("stats") aay: 1: xx treestats.treeid: aay.treeid treestats.nodeid: aay.nodeid treestats.preid: aay.preid treestats.postid: aay.postid treestats.level: aay.level treestats.parent: aay.parent treestats.label: aay.label doquery[] output 0: text / TIME TESTING HARNESS (REMOVE / if you want this) / .time.sum[] / xf: .TIME.f / xt: .TIME.t / xx: xf,'xt / zz: xx @