### This page is to discourage someone from trying the same thing. :) Turns out that hoping that a single binding of the DNA strand on each side is insufficient. (Duh.. i don't know why i didn't see it.) I think there may be hope to recover though, b/c when this does produce an answer that works, it produces a really nice answer. (low strand count, and lots of short strands). However, this might just be luck.

```DNA MicroArray matching Step 1:   Find the "language" of the potential solution Make a window of size W=2.    Label:  WINDOW Slice up the reference DNA into window size chunks. Put these chunks in a Set.   (Sets don't allow duplicates) If the size of the Set < the number of possible windows of size W on the reference DNA,     add this Set to a master Set     W++;     goto WINDOW else move on This slices up the string ```into substring size pieces, adding them to the potential solution set.
It stops adding when the window size is so large that no more repetition is possible.  (sort of like
not being able to compress the string with any larger window)

CHANGE:    before starting, run down the string to find the longest repeated sequence.  The length of this +1 is the "window break"
size.    if in the process of increasing the windows size, we hit window break, then move on to step 2.
``` GCCCGTGCCA```
CA
TG
CG
CC
GT
GC
CGT
CCC
CCG
CCA
TGC
GTG
GCC

Step 2: Reduce the language set until a solution presents itself

Convert the language Set into an array of strands.
Sort the array of strands by length, then by the first index in which they appear in the list.

Now, make a counter matrix with the strands going down one size and the reference DNA on the other side.
Going down the matrix on the strand size; wherever the strand overlaps the reference DNA, place a 1.

```GCCCGTGCCA 111.......   GCC 11........   GC .11.......   CC .111......   CCC ..111.....   CCG ...111....   CGT ...11.....   CG ....111...   GTG ....11....   GT .....11...   TG .....111..   TGC .......111   CCA ........11   CA ```

``` Also keep another matrix which indicates the head and tail points of each substring GCCCGTGCCA 1.1.......   GCC 11........   GC .11.......   CC .1.1......   CCC ..1.1.....   CCG ...1.1....   CGT ...11.....   CG ....1.1...   GTG ....11....   GT .....11...   TG .....1.1..   TGC .......1.1   CCA ........11   CA ```

``` Now, go down the columns and sum up each column of the counter matrix.   replace the 1's in the column with the sum ```

```GCCCGTGCCA 244.......   GCC 24........   GC .44.......   CC .444......   CCC ..445.....   CCG ...455....   CGT ...45.....   CG ....553...   GTG ....55....   GT .....53...   TG .....532..   TGC .......222   CCA ........22   CA ```

Somewhere in this matrix is the answer to the problem,
we just have to keep chipping at the marble to get it out

Starting from the top:
Go across every row.   if the row does not contain a 1, that means that this strand's
contribution is duplicated by another strand further down the list.  This makes it a candidate for
removal.

In order to remove a candidate, we check the columns that it covers in the head/tail matrix.   if
it covers the head or the tail of another string, and if head or tail of this other string has a
count of 2, we do not remove the candidate.  This is because the candidate acts as a "binder" between
2 other strings  EX:
``` ```

```......2221.... ....2222......  Can't remove this string without breaking continuitiy. ..1122........```

If the candidate is not a "binder", then I remove it.  I do this by zero'ing out its line, and for
every column it covers, decrement every non-zero value in the column by 1.

When one hits the bottom of the list, this leads to a selection of strings which are the answer:  (I hope!)``` ```

```TCCATCTCTAGAGTCCGTCTCACACGGCTCTCGCACACGGAGATAGCTC ``````12...............................................   TC .21112...........................................   CCATC .....211112......................................   CTCTAG ..........2111112................................   GAGTCCG ................2111112..........................   GTCTCAC ......................2111112....................   CACGGCT ............................2111112..............   TCTCGCA ..................................2111112........   ACACGGA ........................................2112.....   AGAT ...........................................211111   TAGCTC ```

``` ```

```This doesn't always work.   Algorithim fails for the test string (i don't know why) CACCATTACGTCTAATTACCCCCTCGTGTTCCTAC ```12.................................   CA
.22................................   AC
..212..............................   CCA
....212............................   ATT
......212..........................   TAC
........22.........................   CG
.........21112.....................   GTCTA
.............212...................   AAT
...............21112...............   TTACC
...................21112...........   CCCCT
.......................21112.......   TCGTG
...........................21112...   GTTCC
...............................2111   CTAC
``` ```

``` ```

john lin (jjl364@nyu.edu)