Microarray DNA Selection Algorithm

Kenji Yoshihira's Algorithm

1. Find the first unique sequence starting with the leftmost nucleotide. The number of nucleotides in this sequence is the size of our "window". Let's call this n.
2. Slice the input string into strands of size 2 * n - 1, starting at every n-1 nucleotides.
For example:
• For strand 0, start at nucleotide 0 and end at nucleotide 2 * n - 1
• For strand 1, start at nucleotide n - 1 and end at nucleotide ( n - 1 ) + ( 2 * n - 1 ) = 3 * n - 2
• In general, for strand i, start at i * ( n - 1 ) and end at i * ( n - 1 ) + 2 * n - 1 = ( 2 + i ) * n - i - 1
3. Check if the strands are unique. If not increase the window size and go back to step 2.

Strand Properties

Strands lengths are almost always odd numbers. Because of this they contain a nucleotide which can be considered to be at the center, and an equal number of nucleotides to the left and right of center.

Example: A C G G T T C A G

Because we slice the original string into strands of size 2 * n - 1 at each n - 1 nucleotides, each strand has an overlap of n with the next strand. This means the ith strand has an overlap of n with the strand i-1, and an overlap of n with the strand i+1.

Example:

A C G G T T C A G
A C G G T T C A G G A A C
A C G G T T C A G G A A C T T A G

This overlapping property allows the strands to "snap" into place. This is, since each strand needs to be matched on both the left and on the right, so there is (hopefully) only one possible position in the original string where each strand can be placed.

Extension

Look for smaller strands that preserve the strand uniqueness property, as well as the overlapping/interlocking property. For example, we can change the above strands of length 9 into strands of length 7 and still preserve these properties.

A C G G T T C A G G A A C T T A G
A C G G T T C A G G A A C T T A G
A C G G T T C A G G A A C T T A G
A C G G T T C A G G A A C T T A G
A C G G T T C A G G A A C T T A G [...]

My algorithm: start considering strands with a window size of k, such that 2 * k - 1 > n. This ensures that we can uniquely identify the leftmost strand. If the strands aren't unique, or can verify more than the input string, increment k by 1 and try again. Stop when k = n.

The worst this method should do is Kenji's solution.

jason reisman (jasonr%nyu.edu)