/ Given a bunch of ESTs / Given a bunch of restriction enzymes each of which will cut / at a certain sequence. / Find where the cut is on each EST and k base pairs after. / Then collect all of those fragments (between the first and the / second cut). / We would like each pair of these to be different by at least k / where k is 2 or so. / k bestcut 20term.txt / ABOVE GETS ALL THE DATA / EST individualizer / Author of this code: Dennis Shasha, Mitch Levesque, Philip Benfey / 2002 / BASICS / A faster difference differ:{[x;y] i: y ?/: x j: & i = #y :?x[j] } ceil:{[n]; :[n = (_ n); :(_ n); :1 + (_ n)]} / Convert lower case letter to upper case one / Leave upper case letter alone. myupper:{[let] if[0 ~ #let :let] if[~ (_ic "Z") < (_ic let) :let] val: _ic let val: val - 32 _ci val} myupperword:{[word] :myupper'word } myupperwordscramble:{[word] word[1 + * 1 _draw ((#word)-1)]: "x" :myupper'word } / see whether it's possible that s1 fits within *s2* / basic idea: if s1 fits within *s2* within mindif, / then divide s1 into mindif+1 pieces. / At least one of those pieces must match exactly, else / there are at least mindif +1 differences. prelimtest:{[s1;s2] y: mindif+1 x: _ (#s1) % y cuts: x * !y strings: cuts _ s1 :0 < |/#:' s2 _ss/: strings / at least one matches } / see whether it's possible that s1 fits within *s2* / in an alternative way to prelimtest. / basic idea: if s1 fits within *s2* within mindif, / then divide s1 into mindif+2 pieces. / At least two of those pieces must match exactly, else / there are at least mindif +1 differences. prelimtest2:{[s1;s2] y: mindif+2 x: _ (#s1) % y cuts: x * !y strings: cuts _ s1 :1 < # & 0 < #:' s2 _ss/: strings / at least two match } / see whether it's possible that s1 fits within *s2* / in an alternative way to prelimtest and prelimtest2 / basic idea: if s1 fits within *s2* within mindif, / then divide s1 into mindif+3 pieces. / At least three of those pieces must match exactly, else / there are at least mindif +1 differences. prelimtest3:{[s1;s2] y: mindif+3 x: _ (#s1) % y cuts: x * !y strings: cuts _ s1 :2 < # & 0 < #:' s2 _ss/: strings / at least three match } / see whether there is a dist of mindif or less / for s1 vs. s2, no gaps / We use every trick we can. testdist:{[x;y] / if[mindif < _abs (#x) - (#y); :mindif+1] / if[~prelimtest[x;y]; :-1] / if[(mindif < 2) ; :newdist[x;y]] / if[(mindif < 2) & ((#x) > 8); :newdist[x;y]] res: {y(1+&)\(1_ x)&(-1_ x)-z}\[!1+#y;1+!#x;x=\:y] :res[#x][#y] } / this is for two strings that are the same length or different by 1. / If the strings are equal, we note this. / If they differ by one, we say so. / If they differ by more, we say at least 2. / In order to make this work for greater mindif, / need to generalize sidediffs to minus2zero etc. / Also equation for sides must change because sides that are 1 off / may get contributions from three elements. / Hacks: if sizes are too different, don't look. / if get to a point where diag and two sides are too big, then bail out. newdist:{[s1; s2] diff: (#s1) - (#s2) bigflag: ~ diff = 0 if[(_abs diff) > 1; :2] / beyond 1, so at least _abs diff / note that it could be more. That's why we give a lower bound / for 2 or greater. if[0 < diff x: s2 s2: s1 s1: x ] / so s2 is at least as big as s1 and bigger if bigflag is 1 if[bigflag straights: ~ s1 = s2[!#s1] minus1zero: ~ (s1) = (1 _ s2) / s1[i-1] = s2[i] zerominus1: ~ ((1) _ s1) = ((-2) _ s2) / s1[i] = s2[i-1] sidediffs: +(minus1zero; zerominus1,1) diag: ((#s1) + 1) # 0 sides: (((#s2) + 1);2) # 1 ] if[0 = bigflag straights: ~ s1 = s2 minus1zero: ~ ((-1) _ s1) = (1 _ s2) / s1[i-1] = s2[i] zerominus1: ~ ((1) _ s1) = ((-1) _ s2) / s1[i] = s2[i-1] sidediffs: +(minus1zero; zerominus1) diag: ((#s1) + 1) # 0 sides: (((#s1) + 1);2) # 1 ] j: 1 while[j < (#s1) diag[j]: &/((diag[j-1] + straights[j-1]), 1 + sides[j-1]) / sides[j;0]: &/(diag[j]+1; sides[j-1;0] + minus1zero[j-1]) / sides[j;1]: &/(diag[j]+1; sides[j-1;1] + zerominus1[j-1]) sides[j]: &/(diag[j]+1; sides[j-1] + sidediffs[j-1]) if[1 < &/ (diag[j], sides[j]); :2] j+:1 ] x:&/((diag[j-1] + straights[j-1]), 1 + sides[j-1]) / j = #s1 if[bigflag x: &/(x+1; sides[j-1;0] + minus1zero[j-1]) ] :x } / ABOVE DOES THE DISTANCE CALCULATION findfragments:{[cutpoint; slicelen; string] i: string _ss cutpoint if[0 = #i; :"NO TAG"] i: *| i len: i+slicelen :[len < #string :4 _ string[i + !slicelen] :4 _ string[i + !((#string)-i)]] } / check if distance is ok checkokold:{[mindif; fragments; labels] num: #fragments ii: !num pairs: ,/ (ii ,/:\: ii) pairs@: & pairs[;0] < pairs[;1] out: () i: 0 while[i < #pairs mydist: testdist[fragments[pairs[i;0]];fragments[pairs[i;1]]] if[~ mydist > mindif i0: pairs[i;0] i1: pairs[i;1] out,: ,(labels[i0];fragments[i0]; labels[i1]; fragments[i1]; mydist) ] i+: 1 ] :out } / check if distance is ok checkok:{[mindif; fragments; labels] num: #fragments ii: !num out: () while[1 < #ii first: *ii ii: 1 _ ii pairs: first ,/: ii i: 0 while[i < #pairs mydist: testdist[fragments[pairs[i;0]];fragments[pairs[i;1]]] if[~ mydist > mindif / mydist <= mindif so too close i0: pairs[i;0] i1: pairs[i;1] out,: ,(labels[i0];fragments[i0]; labels[i1]; fragments[i1]; mydist) ] i+: 1 ] ] :out } parse:{[line] if[(line[0] = ">") & (atbeginning = 1) currentlabel:: ` $line currentgene:: () atbeginning:: 0 ] if[(line[0] = ">") & (atbeginning = 0) if[0 < #currentgene tab.label,: currentlabel tab.gene,: ,currentgene ] currentgene:: () currentlabel:: ` $line ] if[~ line[0] = ">" currentgene,: line ] } spit:{[line] (-2) _ ,/ ($line) ,\: (", ")} fasta:{[lines] out: () i: 0 while[i < #lines if[0 < #$lines[i;1] out,: ,$lines[i;0] out,: ,$lines[i;1] ] i+: 1 ] :out } / DATA currentlabel: () currentgene: () tab.label: () tab.gene: () strings: ("ABCABABDEFGHIJKLMNOPQRS" "ABCXYZABABDEFTTTTGHIJKLMNOPQRS" "AXYZABABDEFTTTTGHIJKLMNOPQRS") / Mitch: the slicelen should be the entire restriction / site plus the effective payload. I cut off the first 4 / in any case (even in the GTAC case) because they will / all be the same. The payload is all that matters slicelen: 16 / 4 + 12 cutpoint: ("GTAC") slicelen: 21 / 4 + 17 cutpoint: ("CATG") mindif: 1 / if this close then trouble / EXECUTION outstats: () if[0 = #_i; ` 0: "You need to provide an input file\n"] a: 0: _i[0] / assume there is a file atbeginning: 1 x: parse'a tab.label,: currentlabel tab.gene,: ,currentgene counts: #:' tab.gene outstats,: ,("Total number of labels: "), ($#counts) file: ("tooshort"),(_i[0]) file,: cutpoint / Mitch: I'm labeling the file by the restriction site / MITCH: change the 100 and 99 to whatever you like file 0: $tab.label[& counts < 100] outstats,: ,("Number that are too short equals: "), ($#&counts < 100) ii: & counts > 99 if[0 = #ii; . "\\\\"] tab.label: tab.label[ii] tab.gene: tab.gene[ii] / for testing: / tab.label@: !20 / tab.gene: tab.gene[!10], tab.gene[!10] fragments: findfragments[cutpoint; slicelen]'tab.gene ii: fragments ~\: "NO TAG" outstats,: ,("Number with no restriction site: "), ($#&ii) / out: ,_i[0] / out,: spit'tab.label[&ii],' (` $' fragments[&ii]) out: $ tab.label[&ii] file: ("norestrictionsite"),(_i[0]) file,: cutpoint file 0: out goodlabel: tab.label[&~ii] / remaining good ones / ??? new stuff for mitch on 11/4/02 goodfragments: ` $ fragments[&~ii] xcount: #:'goodfragments xlab: ,/ xcount #' goodlabel xtmp: fasta[ xlab,',/goodfragments] / ??? end of new stuff for mitch file: ("hasrestrictionsite"),(_i[0]) file,: cutpoint file 0: $goodlabel / has some restriction site, even if collision out: ,_i[0] out,: ,("header1, tag, header2, tag, distance ") xx: checkok[mindif; fragments[&~ii]; tab.label[&~ii]] if[0 < #xx badlabel: xx[;0],xx[;2] outstats,: ,("Pairs with 0 distance: "), ($#&xx[;4] = 0) outstats,: ,("Pairs with 1 distance: "), ($#&xx[;4] = 1) if[0 < #&xx[;4]=2 outstats,: ,("Pairs with 2 distance: "), ($#&xx[;4] = 2) ] ii: & xx[;4] = 0 bad0: ? xx[ii;0], xx[ii;2] outstats,: ,("Number of labels knocked out because tags at distance 0: "), ($#bad0) ii: & xx[;4] = 1 bad1: ? xx[ii;0], xx[ii;2] bad1: differ[bad1; bad0] outstats,: ,("Number of labels knocked out because tags at distance 1: "), ($#bad1) ii: & xx[;4] = 2 bad2: ? xx[ii;0], xx[ii;2] bad2: differ[bad2; bad0,bad1] if[0 < #bad2 outstats,: ,("Number of labels knocked out because tags at distance 2: "), ($#bad2) ] goodlabel: differ[goodlabel;badlabel] outstats,: ,("Number with good labels: "), ($#goodlabel) ] out,: spit'xx file: ("tooclose"),(_i[0]) file,: cutpoint file 0: out file: ("stats"),(_i[0]) file,: cutpoint file 0: outstats file: ("goodlabel"), (_i[0]) file,: cutpoint file 0: $goodlabel file: ("goodtags"), (_i[0]) file,: cutpoint file 0: xtmp \\