Microarray Design

Dennis Shasha

Omniheurist Course

Computer Science


 

Description

Given a set of strings, you want the closest matches of a certain length to distinguish one string from the other.

We are cloning, er, higher mammals on our floating laboratory in the South Pacific. The ship is called Noah's Arc, but you must allow a certain poetic license. We don't need two parents for each species, just the sex of interest. We give our mammals certain genetic advantages by inserting specific DNA sent to us by suppliers. We don't completely trust those suppliers, however, so we want to verify the DNA they send.

To do that, we will use a microarray. In each cell of the microarray, we will place short sequences of DNA, called strands. If we expose the supplied DNA S to the chip, any cell whose strand is a consecutive subsequence of S will light up; no others will light up. (Technically, we put the reverse complement of those strands on the microarrays, but the mathematics doesn't change, so let's forget that.)

For example, if the sequence S is AGGTCACGTGG, then a strand consisting of CACG will light up but one consisting of CTCG will not. Similarly, GG will light up but TT will not.

Also, GG will not light up with greater intensity just because it appears twice.

In general, if I give you a sequence, I would like to know what to put in the cells of my microarray. I would like the maximum length strand to be as short as possible and within that length, I'd like to use as few strands as possible.

Strands that match the leftmost end of the DNA sequence will light up with a special color, so you'll know which ones those are. If a strand matches both leftmost and an interior sequence, then it will light up in two colors. Also, by independent means, we can tell you the individual totals of As, Cs, Ts, and Gs. Here are some questions to train your intuition.

What is the smallest number of strands needed for an arbitrarily long sequence?

Answer: Any sequence consisting entirely of a single nucleotide requires no strands at all.

Here's a harder one. What is the shortest sequence S you can imagine such that even if strands can be 6 long, you won't be able to verify S?

Answer: Suppose S were AAAAAATAAAAAA. With strands of length 6, there would be no way to distinguish between S and S' = AAAAAAATAAAAA. Both S and S' have the same number of As, the same number of Ts and no strand of length 6 can distinguish one from the other.

Even small strands can do a good job. For example, if S = ACGAC, then strands of length 2 are enough. Can you see how?

Answer: You place AC and GA. The AC will display in two colors. The GA will display in one. You already know how many of each kind of nucleotide there are. So this distinguishes ACACG from ACGAC.

Until this year we wanted to verify strands. Here are some comments having to do with that problem. This year however, I will give you two strands having the same number of As, Ts, Cs, and Gs but that are slightly different. Your job is to design a set of small strands to distinguish them. The winner will have the smallest maximum sized strand and among players having that same maximum sized strand, the smallest number of strands of that size or less. So, lexicographic ordering based on (max strand size, number of strands).

I will generate two sequences for you of length 100 before class starts. You will generate strands that can distinguish them. The verifier will confirm that your solution gives a unique path or will find ambiguities. Your program is allowed to generate two candidate solutions one that you believe is correct and one that you know is not but you hope the verifier will think is correct. You will get the best of your verified scores. Here is a 49 nucleotide strand with its solution due to Kenji Yoshihira. Here is a strategy idea

The architecture group will provide that verifier. This is a heuristic job by itself.