Algorithm for MicroArray Problem

Original Problem Description

(somewhat abbreviated) Problem Description:

Input:  a dna sequence (composed of nucleotides A, C, G & T) of a certain length, eg. TCGGATTAGCG.
Output:  a set of subsequences of the input (let's call them strands) that can be used to verify the original sequence.

How does verification work?
If an output strand occurs in the original sequence, a light shows in a given color. Where in the sequence the strand is and how long the strand is (position 3-6, position 8-15, etc.) is irrelevant...except:
 - if the original sequence starts with the given strand, the light shows in a special distinguishable color.
 - if a given strand occurs both at the start and somewhere else in the middle too, the light shows in another special distinguishable color.
Also allowed to factor in the knowledge of how many times each nucleotide occurs in the sequence (eg. 7 As, 5 Cs, etc.).

What's a good solution?
Lexicographic order on (length of longest strand, # strands used), lowest wins.

My algorithm runs in 2 stages:

What length should the strands be?

We want the length of the longest strand to be as small as possible. I figured if we're going to use 1 strand of length say 7, then they may as well all be of length 7. Having the others be of lower length doesn't make our solution any better (since the length of the longest strand is the same), and if they're shorter, we'd probably have to use more strands to cover the whole sequence, which would make our solution worse (since it would increase the # strands used).

That being said, what is this shortest length?

There are two trivial cases:
0 strands of length 0:  can be used to verify a sequence composed of one nucleotide only (eg. AAAAA). why? because we know the # of times each nucleotide occurs.
1 strand of length 1:  can be used to verify a sequence of the pattern CTTTTTTT. Specify the C, and the rest can be determined from the nucleotide counts.

OK, those were trivial. Where do we go from here? Start with strand length = 2, and keep incrementing by 1 as long as we can prove that strands of the current length cannot uniquely determine the original sequence.

So how do we prove this cannot uniquely determine 'ness?

Check 1

Find all subsequences of the original sequence of length 2*lenStrand-1. Exclude the one starting at the beginning of the original sequence (cuz this lights differently and hence has a uniquely determined position). Build a list of all pairs of these subsequences (1st & 2nd, 1st & 3rd, 1st & 4th,...,2nd & 3rd, 2nd & 4th, ...). Some of these pairs will have member subsequences that were overlapping in their positions in the original sequence, while most of them will not (the 3rd & the 4th probaly overlap, but the 3rd & the 19th probably dont).

For a pair that does not overlap in the original sequence, if the members of the pair are identical except for their central letters, we can switch these central letters to create an alternate sequence that will respond to strands of length lenStrand exactly the same way as the original:
    lenStrand = 4
    subsequence length = 2*lenStrand-1 = 7
    seq:   ...gttCaga...gttTaca...
    alt:   ...gttTaga...gttCaca...
For a pair that does overlap in the original sequence, if the members of the pair are identical except at the indices corresponding to each of their centers, we can switch the chars at positions in the sequence corresponding to their centers to create an alternate...:
    lenStrand = 3
    subsequence length = 2*lenStrand-1 = 5
    seq:                       C G A A A T A A A
    indices in original seq:   0 1 2 3 4 5 6 7 8
    x = subsequence # 2:           A A A T A
    y = subsequence # 3:             A A T A A
    center of x = idxMidX = 4
    center of y = idxMidY = 5
    check identical'ness
    1st char:  x(0) == seq(2) == A     y(0) == seq(3) == A   equal
    2nd char:  x(1) == seq(3) == A     y(1) == seq(4) == A   equal
    3rd char:  x(2) == seq(4) == seq(idxMidX)                allowed to be different
    4th char:  x(3) == seq(5) == seq(idxMidY)                allowed to be different
    5th char:  x(4) == seq(6) == A     y(4) == seq(7) == A   equal
    if we switch the chars at idxMidX & idxMidY, we get:
    alt:                       C G A A T A A A A
    this alt cannot be distinguished from the original seq by strands of length 3

Check 2

Find all subsequences of the original sequence of length lenStrand-1. Exclude the one starting at the beginning. These represent the maximal overlaps between strands of length lenStrand. eg. if lenStrand=3, maximal overlap is of length 2.

If any of these maximal overlaps occur three times:
    maximal overlap occuring 3 times: x
    seq:  AxBxCxD
    alt:  AxCxBxD
    x:    ac
    seq:  CTacTacGGacCG
    alt:  CTacGGacTacGG
At least one of B or C must be non-empty, otherwise the rearrangement would not have changed the sequence. If they were the same, the sequence would have already failed Check 1.

If there are two of these maximal overlaps that occur twice each and the positions they occur at are alternating:
    maximal overlaps occuring twice in alternating order: x & y
    seq:  AxByCxDyE
    alt:  AxDyCxByE
    x:    ac
    y:    gt
    seq:  AGacGTCgtCCAGATacTgtTAA
    alt:  AGacTgtCCAGATacGTCgtTAA
B and D must be different and C must be non-empty. If B and D were the same, the rearrangement would not have changed the sequence. If C was non-empty, the sequence would have already failed Check 1.

Check 3

Make sure the end of the sequence is determinable. To ensure this, the last unique strand must cover the last character in the sequence, or all characters after the last character of the last unique strand must be the same. Otherwise..
    chars at the end not covered by unique strand set: x & y
    seq:  ...xy
    alt:  ...yx
Everything earlier is the same, the rearranging x & y doesn't change the nucleotide counts.

If a candidate strand length passes all these checks, I think that if a verifier were fed every single subsequence of this length (which would mean maximal overlaps everywhere), there would be no way to rearrange the strands to find an alternate sequence that matched every input strand.

Try minimize the # of strands used

Now that we know what the minimum size for strands that can uniquely verify the sequence is, we want a minimal subset of these.

We could try every possible combination of these, but this is a search space of size O(maxStrands!) which is a bit too large. So instead, we try to drop strands selectively:

When we drop a strand, we still want to ensure that the region of the sequence it was covering is still in some way uniquely specified. I say in some way becuase it doesn't actually have to be unique. What we really need is, going rightwards from the start strand (whose position is uniquely determined because it lights in a special way), if at any point we tried to connect a strand in a way that doesn't match the original sequence, it wouldn't work because either
 - a change in overlaps would make the resulting alternate sequence too long
 - the nucleotide counts would be wrong

So I start with every possible strand of the length we're working with. Then I build this 3d matrix containing every possible overlap of every one of these strands with each other. Matrix dimensions are:
  maxStrands * maxStrands * maxOverlap
If non-consecutive strands do not overlap correctly for any of the possible maxOverlap char-length overlaps, that entry would be null. This matrix now contains a mix of real (in the original sequence) overlaps, and fake ones (which could be exploited to rearrange the strands to form an alternate sequence). We end up having tons of entries for single-char overlaps, and lower counts as the length of the overlap increases. Higher char-length yields more overlaps that are unique.

Now I look for minimal-length unique real overlaps of length less than maxOverlap. When I find one that's unique, all super-overlaps of that must be unique too:

Suppose lenStrand = 5 and we have a section of the original sequence ...AGCATTA... which yields strands AGCAT, GCATT, & CATTA.

Suppose the overlap of CAT b/w AGCAT & CATTA is unique. Then the super-overlaps GCAT & CATT must be unique too. (if there was another GCAT somewhere else, then there would be another CAT too, right?).

So this unique CAT tells us that going rightwards from the start, once we add AGCAT and we're looking for the next strand to add, if we wanted to do an overlap of 4 using the GCAT, there is only one choice - the real one. If we wanted to do an overlap of 3 using the CAT, there is also only one choice - the real one. We could do a fake overlap (trying to find an alternate sequence) of 2 or 1 (with some other strand starting with AT or just T), or even 0, but then the alternate sequence would turn out too long.

So...we have uniquely specified what needs to come after AGCAT.

And...since the connection is uniquely specified with just the CATTA, the GCATT is redundant!! So we can drop the GCATT, and our strand count has dropped by one.

Clear all GCATT's entries from the matrix (since it is no longer part of our strand set), and if we're lucky, the removal of these entries will make some more overlaps unique opening up more chances for eliminating redundant strands.

Repeat...until we can't find any more redundant strands.

What's left is our minimal subset.