Microarray Design

Dennis Shasha

Omniheurist Course

Computer Science


 

Description

Given a set of strings, you want the closest matches of a certain length to distinguish one string from the other.

We are cloning, er, higher mammals on our floating laboratory in the South Pacific. The ship is called Noah's Arc, but you must allow a certain poetic license. We don't need two parents for each species, just the sex of interest. We give our mammals certain genetic advantages by inserting specific DNA sent to us by suppliers. We don't completely trust those suppliers, however, so we want to verify the DNA they send.

To do that, we will use a microarray. In each cell of the microarray, we will place short sequences of DNA, called strands. If we expose the supplied DNA S to the chip, any cell whose strand is a consecutive subsequence of S will light up; no others will light up. (Technically, we put the reverse complement of those strands on the microarrays, but the mathematics doesn't change, so let's forget that.)

For example, if the sequence S is AGGTCACGTGG, then a strand consisting of CACG will light up but one consisting of CTCG will not. Similarly, GG will light up but TT will not.

Also, GG will not light up with greater intensity just because it appears twice.

In general, if I give you a sequence, I would like to know what to put in the cells of my microarray. I would like the maximum length strand to be as short as possible and within that length, I'd like to use as few strands as possible.

Strands that match the leftmost end of the DNA sequence will light up with a special color, so you'll know which ones those are. Also, by independent means, we can tell you the individual totals of As, Cs, Ts, and Gs. Here are some questions to train your intuition.

What is the smallest number of strands needed for an arbitrarily long sequence?

Answer: Any sequence consisting entirely of a single nucleotide requires no strands at all.

Here's a harder one. What is the shortest sequence S you can imagine such that even if strands can be 6 long, you won't be able to verify S?

Answer: Suppose S were AAAAAATAAAAAA. With strands of length 6, there would be no way to distinguish between S and S' = AAAAAAATAAAAA. Both S and S' have the same number of As, the same number of Ts and no strand of length 6 can distinguish one from the other.

Even small strands can do a good job. For example, if S = ACGAC, then strands of length 2 are enough. Can you see how?

Answer: You place AC, CG, GA, CA, and CC in different cells. You know that there are two As, two Cs, one G, and zero Ts. You will see that AC, CG, and GA will light up but CA and CC will not. Further the special color tells you that AC is at the left end of the sequence. The next letter must be G, because CA, CC, and CT are all forbidden. Following G, there is one A and one C left. If GCA comprise the last three nucleotides, then there is no explanation for the fact that GA has lit up. So the last three must be GAC. This should be close to optimal.

Here is an example sequence we want to verify: TCACTCGGCTCTCGCACACGGAGATAGCTC. What is the smallest size and the smallest number of strands of that size that would verify this sequence?

Your program is to answer this question in general. I will generate a sequence for you of length 100 before class starts. The verifier will confirm that your solution gives a unique path or will find ambiguities. Ben will provide that verifier.