/ Given a tags with their genes from two files, find the ones that are similar / k compare file1 file2 / ABOVE GETS ALL THE DATA / EST individualizer / Author of this code: Dennis Shasha, Mitch Levesque, Philip Benfey / 2002 / BASICS / A faster difference differ:{[x;y] i: y ?/: x j: & i = #y :?x[j] } ceil:{[n]; :[n = (_ n); :(_ n); :1 + (_ n)]} / Convert lower case letter to upper case one / Leave upper case letter alone. myupper:{[let] if[0 ~ #let :let] if[~ (_ic "Z") < (_ic let) :let] val: _ic let val: val - 32 _ci val} myupperword:{[word] :myupper'word } myupperwordscramble:{[word] word[1 + * 1 _draw ((#word)-1)]: "x" :myupper'word } / see whether it's possible that s1 fits within *s2* / basic idea: if s1 fits within *s2* within mindif, / then divide s1 into mindif+1 pieces. / At least one of those pieces must match exactly, else / there are at least mindif +1 differences. prelimtest:{[s1;s2] y: mindif+1 x: _ (#s1) % y cuts: x * !y strings: cuts _ s1 :0 < |/#:' s2 _ss/: strings / at least one matches } / see whether it's possible that s1 fits within *s2* / in an alternative way to prelimtest. / basic idea: if s1 fits within *s2* within mindif, / then divide s1 into mindif+2 pieces. / At least two of those pieces must match exactly, else / there are at least mindif +1 differences. prelimtest2:{[s1;s2] y: mindif+2 x: _ (#s1) % y cuts: x * !y strings: cuts _ s1 :1 < # & 0 < #:' s2 _ss/: strings / at least two match } / see whether it's possible that s1 fits within *s2* / in an alternative way to prelimtest and prelimtest2 / basic idea: if s1 fits within *s2* within mindif, / then divide s1 into mindif+3 pieces. / At least three of those pieces must match exactly, else / there are at least mindif +1 differences. prelimtest3:{[s1;s2] y: mindif+3 x: _ (#s1) % y cuts: x * !y strings: cuts _ s1 :2 < # & 0 < #:' s2 _ss/: strings / at least three match } / see whether there is a dist of mindif or less / for s1 vs. s2, no gaps / We use every trick we can. testdist:{[x;y] / if[mindif < _abs (#x) - (#y); :mindif+1] / if[~prelimtest[x;y]; :-1] / if[(mindif < 2) ; :newdist[x;y]] / if[(mindif < 2) & ((#x) > 8); :newdist[x;y]] res: {y(1+&)\(1_ x)&(-1_ x)-z}\[!1+#y;1+!#x;x=\:y] :res[#x][#y] } / this is for two strings that are the same length or different by 1. / If the strings are equal, we note this. / If they differ by one, we say so. / If they differ by more, we say at least 2. / In order to make this work for greater mindif, / need to generalize sidediffs to minus2zero etc. / Also equation for sides must change because sides that are 1 off / may get contributions from three elements. / Hacks: if sizes are too different, don't look. / if get to a point where diag and two sides are too big, then bail out. newdist:{[s1; s2] diff: (#s1) - (#s2) bigflag: ~ diff = 0 if[(_abs diff) > 1; :2] / beyond 1, so at least _abs diff / note that it could be more. That's why we give a lower bound / for 2 or greater. if[0 < diff x: s2 s2: s1 s1: x ] / so s2 is at least as big as s1 and bigger if bigflag is 1 if[bigflag straights: ~ s1 = s2[!#s1] minus1zero: ~ (s1) = (1 _ s2) / s1[i-1] = s2[i] zerominus1: ~ ((1) _ s1) = ((-2) _ s2) / s1[i] = s2[i-1] sidediffs: +(minus1zero; zerominus1,1) diag: ((#s1) + 1) # 0 sides: (((#s2) + 1);2) # 1 ] if[0 = bigflag straights: ~ s1 = s2 minus1zero: ~ ((-1) _ s1) = (1 _ s2) / s1[i-1] = s2[i] zerominus1: ~ ((1) _ s1) = ((-1) _ s2) / s1[i] = s2[i-1] sidediffs: +(minus1zero; zerominus1) diag: ((#s1) + 1) # 0 sides: (((#s1) + 1);2) # 1 ] j: 1 while[j < (#s1) diag[j]: &/((diag[j-1] + straights[j-1]), 1 + sides[j-1]) / sides[j;0]: &/(diag[j]+1; sides[j-1;0] + minus1zero[j-1]) / sides[j;1]: &/(diag[j]+1; sides[j-1;1] + zerominus1[j-1]) sides[j]: &/(diag[j]+1; sides[j-1] + sidediffs[j-1]) if[1 < &/ (diag[j], sides[j]); :2] j+:1 ] x:&/((diag[j-1] + straights[j-1]), 1 + sides[j-1]) / j = #s1 if[bigflag x: &/(x+1; sides[j-1;0] + minus1zero[j-1]) ] :x } / ABOVE DOES THE DISTANCE CALCULATION findfragments:{[cutpoint; slicelen; string] i: string _ss cutpoint if[0 = #i; :"NO TAG"] i: *| i len: i+slicelen :[len < #string :4 _ string[i + !slicelen] :4 _ string[i + !((#string)-i)]] } / check if distance is ok checkokold:{[mindif; fragments; labels] num: #fragments ii: !num pairs: ,/ (ii ,/:\: ii) pairs@: & pairs[;0] < pairs[;1] out: () i: 0 while[i < #pairs mydist: testdist[fragments[pairs[i;0]];fragments[pairs[i;1]]] if[~ mydist > mindif i0: pairs[i;0] i1: pairs[i;1] out,: ,(labels[i0];fragments[i0]; labels[i1]; fragments[i1]; mydist) ] i+: 1 ] :out } / check if distance is ok checkok:{[mindif; fragments; labels] num: #fragments ii: !num out: () while[1 < #ii first: *ii ii: 1 _ ii pairs: first ,/: ii i: 0 while[i < #pairs mydist: testdist[fragments[pairs[i;0]];fragments[pairs[i;1]]] if[~ mydist > mindif / mydist <= mindif so too close i0: pairs[i;0] i1: pairs[i;1] out,: ,(labels[i0];fragments[i0]; labels[i1]; fragments[i1]; mydist) ] i+: 1 ] ] :out } parse:{[line] if[(line[0] = ">") & (atbeginning = 1) currentlabel:: ` $line currentgene:: () atbeginning:: 0 ] if[(line[0] = ">") & (atbeginning = 0) if[0 < #currentgene tab.label,: currentlabel tab.gene,: ,currentgene ] currentgene:: () currentlabel:: ` $line ] if[~ line[0] = ">" currentgene,: line ] } spit:{[line] (-2) _ ,/ ($line) ,\: (", ")} fasta:{[lines] out: () i: 0 while[i < #lines if[0 < #$lines[i;1] out,: ,$lines[i;0] out,: ,$lines[i;1] ] i+: 1 ] :out } / DATA files: _i data1: 1: files[0] data2: 1: files[1] gene1: data1[;0] tag1: $ data1[;1] gene2: data2[;0] tag2: $ data2[;1] / EXECUTION outbad: () i: 0 while[i < #tag1 j: 0 flag: 1 while[flag & (j < #tag2) mydist: testdist[tag1[i]; tag2[j]] if[mydist < 3 outbad,: ,(gene1[i]; tag1[i]; gene2[j]; tag2[j]; mydist) flag: 0 ] j+: 1 ] i+: 1 ] file: ("compare"),(files[0]),(files[1]) file 0: spit'outbad