If an output strand occurs in the original sequence, a light shows in a given color. Where in the sequence the strand is and how long the strand is (position 3-6, position 8-15, etc.) is irrelevant...except:

- if the original sequence starts with the given strand, the light shows in a special distinguishable color.

- if a given strand occurs both at the start and somewhere else in the middle too, the light shows in another special distinguishable color.

Also allowed to factor in the knowledge of how many times each nucleotide occurs in the sequence (eg. 7 As, 5 Cs, etc.).

Lexicographic order on (length of longest strand, # strands used), lowest wins.

My algorithm runs in 2 stages:

We want the length of the longest strand to be as small as possible. I figured if we're going to use 1 strand of length say 7, then they may as well all be of length 7. Having the others be of lower length doesn't make our solution any better (since the length of the longest strand is the same), and if they're shorter, we'd probably have to use more strands to cover the whole sequence, which would make our solution worse (since it would increase the # strands used).

That being said, what is this shortest length?

There are two trivial cases:

0 strands of length 0: can be used to verify a sequence composed of one nucleotide only (eg. AAAAA). why? because we know the # of times each nucleotide occurs.

1 strand of length 1: can be used to verify a sequence of the pattern CTTTTTTT. Specify the C, and the rest can be determined from the nucleotide counts.

OK, those were trivial. Where do we go from here? Start with strand length = 2, and keep incrementing by 1 as long as we can prove that strands of the current length

So how do we prove this

Find all subsequences of the original sequence of length 2*

For a pair that does not overlap in the original sequence, if the members of the pair are identical except for their central letters, we can switch these central letters to create an alternate sequence that will respond to strands of length

lenStrand = 4 subsequence length = 2*lenStrand-1 = 7 seq: ...gttFor a pair that does overlap in the original sequence, if the members of the pair are identical except at the indices corresponding toC aga...gttT aca... alt: ...gttT aga...gttC aca...

lenStrand = 3 subsequence length = 2*lenStrand-1 = 5 seq: C G A AA T A A A indices in original seq: 0 1 2 3 4 5 6 7 8 x = subsequence # 2: A A A T A y = subsequence # 3: A A T A A center of x = idxMidX = 4 center of y = idxMidY = 5 check identical'ness 1st char: x(0) == seq(2) == A y(0) == seq(3) == A equal 2nd char: x(1) == seq(3) == A y(1) == seq(4) == A equal 3rd char: x(2) == seq(4) == seq(idxMidX) allowed to be different 4th char: x(3) == seq(5) == seq(idxMidY) allowed to be different 5th char: x(4) == seq(6) == A y(4) == seq(7) == A equal if we switch the chars at idxMidX & idxMidY, we get: alt: C G A AT A A A A this alt cannot be distinguished from the original seq by strands of length 3

Find all subsequences of the original sequence of length

If any of these

maximal overlap occuring 3 times: x seq: AxAt least one of B or C must be non-empty, otherwise the rearrangement would not have changed the sequence. If they were the same, the sequence would have already failed Check 1.B xC xD alt: AxC xB xD eg: x: ac seq: CTacT acGG acCG alt: CTacGG acT acGG

If there are two of these

maximal overlaps occuring twice in alternating order: x & y seq: AxB and D must be different and C must be non-empty. If B and D were the same, the rearrangement would not have changed the sequence. If C was non-empty, the sequence would have already failed Check 1.B yCxD yE alt: AxD yCxB yE eg: x: ac y: gt seq: AGacGTC gtCCAGATacT gtTAA alt: AGacT gtCCAGATacGTC gtTAA

Make sure the end of the sequence is determinable. To ensure this, the last unique strand must cover the last character in the sequence, or all characters after the last character of the last unique strand must be the same. Otherwise..

chars at the end not covered by unique strand set: x & y seq: ...xy alt: ...yxEverything earlier is the same, the rearranging x & y doesn't change the nucleotide counts.

If a candidate strand length passes all these checks,

Now that we know what the minimum size for strands that can uniquely verify the sequence is, we want a minimal subset of these.

We could try every possible combination of these, but this is a search space of size

When we drop a strand, we still want to ensure that the region of the sequence it was covering is still

- a change in overlaps would make the resulting alternate sequence too long

- the nucleotide counts would be wrong

So I start with every possible strand of the length we're working with. Then I build this 3d matrix containing every possible overlap of every one of these strands with each other. Matrix dimensions are:

If non-consecutive strands do not overlap correctly for any of the possible

Now I look for minimal-length unique

Suppose

Suppose the overlap of

So this unique

So...we have uniquely specified what needs to come after AG

And...since the connection is uniquely specified with just the

Clear all G

Repeat...until we can't find any more redundant strands.

What's left is our minimal subset.