Restriction Enzyme Game (Information Synthesis)

Dennis Shasha

Omniheurist Course

Computer Science



There is a DNA that comes to us from an agent in a certain, well, not so friendly, country. We have only the data I'm going to give you, not the sequence itself. This virus can kill and appears to be contagious, partly because of its small size. Our agent reports many deaths at the factory producing the virus. If released, it could cause untold more deaths. We know about two of its sequences. To analyze them, our agent used three restriction enzymes. He then separated them using agarose gel electrophoresis...

Let me start more slowly. It's just that there is so little time. You know that DNA can be viewed as a sequence of As, Cs, Ts, and Gs. It exists most commonly in the form of a double-helix, but because one side of the helix determines the order of the other side, we will pay attention only to one side and pretend we have just a single strand consisting of half of the helix. The sequence is read left to right from what is known as the 5' (read 5 prime) end towards the 3' end.

Certain kinds of proteins called restriction enzymes cut the sequence at particular points. In our case, we have three, which we'll call r1, r2, and r3. Restriction enzyme r1 will cut a sequence at CT and split C from T. Similarly, r2 will cut at AG and r3 will cut at TT. If there are three in a row, it will cut only once however. After one or more restriction enzymes perform their cuts, the ministrands that result can be ordered by size. When they are small enough, the ministrands can be sequenced directly. Given the sizes after each cut and the sequences of the final ministrands, we must figure out the sequence of the whole strand.

In particular:
if r3 sees TTT, it will cut this as T TT.
If r3 sees TTTT, it will cut this as T TT T.
If r3 sees TTTTT, it will cut this as T TT TT.

Further, there is the potential for an interaction between the enzyme for CT and the one for TT, so you should know that the three enzymes will cut:
CTT into C, T, T;
CTTT into C, T, TT;
CTTTT into C, T, TT, T;
CTTCT into C, T, TC, T; and
CTTTCT into C, T, TTC, T. Let me give you an example, suppose we have the sequence: CATTCTGTATA. If we cut it with r1 alone, we will split the sequence into CATTC and TGTATA; sizes 5 and 6. If we cut with r2 alone, no split will occur. If we cut with r3 alone, we will get CAT and TCTGTATA, of sizes 3 and 8. If we cut with both r1 and r3, we will get: CAT, TC, TGTATA.

Of course, if the agent's data told us the sequences and their orders, our problem would be over. Unfortunately, all we know is their sizes and the contents of the final arrangement. So, revisiting the example, imagine that I told you only the following: (i) splitting with r1 gave sizes 5, 6 (ii) splitting with r2 gives no split (iii) splitting with r3 gives sizes 3 and 8 (iv) splitting with r1, r2, and r3 gives three cuts (which I'm deliberately writing in order of size, because that's the order in which all these results are given): TC, CAT, TGTATA. What would you do?

To infer the order, note that none of our enzymes cuts at AT or AC or CC junctions. So TC can bind only to TGTATA (which is also consistent with the fact that r3 splits the sequence into lengths 3 and 8). Further TGTATA must be on the right because it cannot bind to anything. This gives us a single possible ordering CAT TC TGTATA.''

Here is the data about the smaller sequence. The restriction enzymes cut as stated before (r1 cuts at CT, r2 cuts at AG, and r3 cuts at TT.

What cuts: Sizes of cuts (in order of size)
r1 2 2 2 7 9 11 17
r2 2 4 4 11 29
r3 1 49
r1 r2 2 2 2 2 2 2 4 7 7 9 11
r1 r3 1 2 2 2 6 9 11 17
r2 r3 1 2 4 4 10 29

Finally, the last thing we know is the collection of sequences in order of size after being cut by all three restriction enzymes:

Our question to you now is what is the entire sequence in its natural order? If you can't be completely precise, we understand, but the more you can tell us, the easier it will be for us to find an anti-body.

What do the repeats, like the ones for TC, mean?

According to our agent, they mean that TC appears three separate times in the sequence. Generate a set of sizes and their sequences for each kind of restriction and each combination of restrictions. From this, figure out the entire sequence. Here is an intermediate result: 0: "T" 1-6: "TCCATC" 7-8: "TC" 9-10: "TA" 11-12: "GA" 13-19: "GTCCGTC" 20-28 "TCACACGGC" 29-30 "TC" 31-41 "TCGCACACGGA" 42-45 "GATA" 46: "GC" 48: "TC"


Your Task

There is no architecture team for this problem. You will be given information like what you see above and you are to try to determine the sequence. I will wait until class to give you that information.