MICROARRAY PROBLEM

Source Code: microarray.cpp

Basic Idea:

In this problem, the following terms are used:

X: The original DNA sequence

A={A1,A2,...,An}: Test arrays

Y: Another DNA sequence shares same test results using A.

POSITIVE TEST: If array Ai appears in X, then Ai is a positive test.

NEGATIVE TEST: If array Ai is not a subsequence of X, then Ai is a negative test.

* NEGATIVE TEST is important!

Consider the following sequence:

ACACACACACACACACACACACAC...AC (Length=k)

If only positive test were done, the minimum length required will be k-3 with

A={ACACAC...AC (Length=k-3), CACACA...CA (Length=k-3)}

However, if negative test was done, then minimum length required will be 2

A={AC, AA, CC} suffices!

¡@

My Algorithm:

So what I do is:

(1) Choose a length K, start by a big one (say, 10) then decrease it.

(2) Generate all POSITIVE test strand, that is, X[1..k], X[2..k+1], ..., X[L-k+1..L]

(3) Randomly swap them and find a "wordsnake" via Hungarian Algorithm, this is an approximation of "Minimum Superstring".

(4) Use "Brute Force search" to expand the string if necessary, apply NEGATIVE test to distinguish with X.

Step 4 could be derived as following:

Assume that the generated wordsnake is W = AC....GTCA, and there are 1A, 1C, 1G, 1T left.

If we put A after W, then possible NEGATIVE test will be AA, CAA, TCAA, GTCAA, if any of them does not appear in X, we shouldn't put A there. However, if all of them were substrings of X, then we could try to put A there and search down to next character.

(5) If Y is found for some K, that is, all POSITIVE and NEGATIVE test could not distinguish it with X, find all this kind of Y and define those sequences as "Dangerous" one.

(6) Went back to K+1, indicate the key substring that could distinguish X with dangerous sequences, these string must be included in final arrays.

(7) Then start from the head of DNA sequence, insert current substring into final arrays, and try to match this with all other length K+1 substrings. If the maximum match is t, then the next substring will start with an overhead of t+1 with current substring. Iterate this through the end of DNA.

(8) Check this set again and see if any NEGATIVE test needed.

NOTE: What I guarentee is if your k is smaller than mine, then Y could be found by my program!

That is, the dangerous string.

Probelm Left:

(1) Is the verifier of this game an NP-complete? I guess it could be somewhat reduced to SAT problem.

(2) If all Y which could pass all POSITIVE tests could be found, then I have a way to find the minimum set of strands. Is there any brute force or more elegant algorithm to find all Y? This problem is one layer harder than (1) in Alternating Turing Machine model since the above one suffices to find any Y if Y exists, but now we got to find all of them.

Chien-I Liao