Algorithm for MicroArray Problem
Original Problem Description
(somewhat abbreviated) Problem Description:
Input: a dna sequence (composed of nucleotides A, C,
G & T) of a certain length, eg. TCGGATTAGCG.
Output: a set of subsequences of the input (let's call
them strands) that can be used to verify the original sequence.
How does verification work?
If an output strand occurs in the original sequence, a light shows in a
given color. Where in the sequence the strand is and how long the strand
is (position 3-6, position 8-15, etc.) is irrelevant...except:
- if the original sequence starts with the given strand, the
light shows in a special distinguishable color.
- if a given strand occurs both at the start and somewhere else
in the middle too, the light shows in another special distinguishable
Also allowed to factor in the knowledge of how many times each nucleotide
occurs in the sequence (eg. 7 As, 5 Cs, etc.).
What's a good solution?
Lexicographic order on (length of longest strand, # strands used), lowest
My algorithm runs in 2 stages:
What length should the strands be?
We want the length of the longest strand to be as small as possible. I
figured if we're going to use 1 strand of length say 7, then they may as
well all be of length 7. Having the others be of lower length doesn't
make our solution any better (since the length of the longest strand is
the same), and if they're shorter, we'd probably have to use more strands
to cover the whole sequence, which would make our solution worse (since
it would increase the # strands used).
That being said, what is this shortest length?
There are two trivial cases:
0 strands of length 0: can be used to verify a sequence composed
of one nucleotide only (eg. AAAAA). why? because we know the # of times
each nucleotide occurs.
1 strand of length 1: can be used to verify a sequence of the
pattern CTTTTTTT. Specify the C, and the rest can be determined from
the nucleotide counts.
OK, those were trivial. Where do we go from here? Start with strand
length = 2, and keep incrementing by 1 as long as we can prove that
strands of the current length cannot uniquely determine the
So how do we prove this cannot uniquely determine 'ness?
Find all subsequences of the original sequence of length
2*lenStrand-1. Exclude the one starting at the beginning of
the original sequence (cuz this lights differently and hence has
a uniquely determined position). Build a list of all pairs of these
subsequences (1st & 2nd, 1st & 3rd, 1st & 4th,...,2nd & 3rd, 2nd & 4th,
...). Some of these pairs will have member subsequences that
were overlapping in their positions in the original sequence, while
most of them will not (the 3rd & the 4th probaly overlap, but the 3rd
& the 19th probably dont).
For a pair that does not overlap in the original sequence, if the
members of the pair are identical except for their central letters,
we can switch these central letters to create an alternate sequence
that will respond to strands of length lenStrand exactly the
same way as the original:
lenStrand = 4
subsequence length = 2*lenStrand-1 = 7
For a pair that does overlap in the original sequence, if the members
of the pair are identical except at the indices corresponding to
each of their centers, we can switch the chars at positions in
the sequence corresponding to their centers to create an alternate...:
lenStrand = 3
subsequence length = 2*lenStrand-1 = 5
seq: C G A A A T A A A
indices in original seq: 0 1 2 3 4 5 6 7 8
x = subsequence # 2: A A A T A
y = subsequence # 3: A A T A A
center of x = idxMidX = 4
center of y = idxMidY = 5
1st char: x(0) == seq(2) == A y(0) == seq(3) == A equal
2nd char: x(1) == seq(3) == A y(1) == seq(4) == A equal
3rd char: x(2) == seq(4) == seq(idxMidX) allowed to be different
4th char: x(3) == seq(5) == seq(idxMidY) allowed to be different
5th char: x(4) == seq(6) == A y(4) == seq(7) == A equal
if we switch the chars at idxMidX & idxMidY, we get:
alt: C G A A T A A A A
this alt cannot be distinguished from the original seq by strands of length 3
Find all subsequences of the original sequence of length
lenStrand-1. Exclude the one starting at the beginning. These
represent the maximal overlaps between strands of length
lenStrand. eg. if lenStrand=3, maximal overlap is of
If any of these maximal overlaps occur three times:
maximal overlap occuring 3 times: x
At least one of B or C must be non-empty, otherwise the rearrangement
would not have changed the sequence. If they were the same, the sequence
would have already failed Check 1.
If there are two of these maximal overlaps that occur twice each
and the positions they occur at are alternating:
maximal overlaps occuring twice in alternating order: x & y
B and D must be different and C must be non-empty. If B and D were the
same, the rearrangement would not have changed the sequence. If C was
non-empty, the sequence would have already failed Check 1.
Make sure the end of the sequence is determinable. To ensure this, the
last unique strand must cover the last character in the sequence, or all
characters after the last character of the last unique strand must be the
chars at the end not covered by unique strand set: x & y
Everything earlier is the same, the rearranging x & y doesn't change
the nucleotide counts.
If a candidate strand length passes all these checks, I think that
if a verifier were fed every single subsequence of this length
(which would mean maximal overlaps everywhere), there would be no way to
rearrange the strands to find an alternate sequence that matched every
Try minimize the # of strands used
Now that we know what the minimum size for strands that can uniquely
verify the sequence is, we want a minimal subset of these.
We could try every possible combination of these, but this is a search
space of size O(maxStrands!) which is a bit too large. So
instead, we try to drop strands selectively:
When we drop a strand, we still want to ensure that the region of the
sequence it was covering is still in some way uniquely specified.
I say in some way becuase it doesn't actually have to be unique.
What we really need is, going rightwards from the start strand (whose
position is uniquely determined because it lights in a special way), if at
any point we tried to connect a strand in a way that doesn't match the
original sequence, it wouldn't work because either
- a change in overlaps would make the resulting alternate
sequence too long
- the nucleotide counts would be wrong
So I start with every possible strand of the length we're working with.
Then I build this 3d matrix containing every possible overlap of every one
of these strands with each other. Matrix dimensions are:
maxStrands * maxStrands * maxOverlap
If non-consecutive strands do not overlap correctly for any of the
possible maxOverlap char-length overlaps, that entry would be
null. This matrix now contains a mix of real (in the original
sequence) overlaps, and fake ones (which could be exploited to
rearrange the strands to form an alternate sequence). We end up having
tons of entries for single-char overlaps, and lower counts as the length
of the overlap increases. Higher char-length yields more overlaps that
Now I look for minimal-length unique real overlaps of length less
than maxOverlap. When I find one that's unique, all
super-overlaps of that must be unique too:
Suppose lenStrand = 5 and we have a section of the original sequence
...AGCATTA... which yields strands AGCAT, GCATT, & CATTA.
Suppose the overlap of CAT b/w AGCAT &
CATTA is unique. Then the super-overlaps
GCAT & CATT must be unique too. (if there was
another GCAT somewhere else, then there would be another
CAT too, right?).
So this unique CAT tells us that going rightwards from the
start, once we add AGCAT and we're looking for the next strand
to add, if we wanted to do an overlap of 4 using the GCAT,
there is only one choice - the real one. If we wanted to do an overlap of
3 using the CAT, there is also only one choice - the real one.
We could do a fake overlap (trying to find an alternate sequence)
of 2 or 1 (with some other strand starting with AT or just T), or even 0,
but then the alternate sequence would turn out too long.
So...we have uniquely specified what needs to come after AGCAT.
And...since the connection is uniquely specified with just the
CATTA, the GCATT is redundant!! So we can drop the
GCATT, and our strand count has dropped by one.
Clear all GCATT's entries from the matrix (since it is no
longer part of our strand set), and if we're lucky, the removal of these
entries will make some more overlaps unique opening up more chances for
eliminating redundant strands.
Repeat...until we can't find any more redundant strands.
What's left is our minimal subset.