Welcome to the Wordsnakes game web page! Invented by NYU professor Dennis
Shasha, this game has inspired his students to compete on who will be the one to
come up with the best Wordsnake ever… The funny part of the story is that you
are not really in a position to claim with 100% certainty that your Wordsnake is
the best possible! But what is a Wordsnake after all? For the original web page
describing the game look __here__.
On this web site, following a short description of the game, you can find an
algorithm on how to play the game, its implementation in language K, and
instructions on how to run it.

*Game Description*

The rules of the game are rather simple; each player is given the same set of words, which they are asked to arrange them in such away, so as to achieve the maximum score, defined in the following rules.

- For each pair of successive words, the suffix of the first word is compared with the prefix of the word that follows, and the maximal overlap is obtained. The score for such a pair of words equals the size of their maximal overlap to the square. For example, consider a two-word sequence: "street", "test". In this order, their maximal overlap is of size one (just the letter "t"). Thus, their score is 1. However, if the order is reversed ("test", street"), the maximal overlap is of size two (suffix-prefix "st"), which yields a score 4.

- The score of a sequence of words ("wordsnake") equals the sum of the partial scores of all pairs of successive words. For example, the sequence: "run", "nausea", "seasick", receives a score of 1 for the first and second word, plus 9 for the second and third words, yielding a total score of 10.

- The suffix-prefix matching is performed in a case-insensitive manner, which implies that small and capital letters are considered indistinguishable.

*Algorithm*

The technique proposed here for finding the maximum-score word sequence
presented here, splits the problem of arranging the words into sub-problems. The
input of sub-problem i is a list of words e_{i} =
[w_{j}^{i}], where 1 £ j £ n_{i}. The output is a modified word list which
serves as an input to the next sub-problem, and the current score, initially set
to 0.

In each sub-problem, all possible two-word sequences, described as a vector
__v___{jk} = (w_{j}, w_{k}), j¹ k, are scored according to their overlap. Those that yield
the maximum score are stored in set S_{i}. We construct a graph G(V,E),
such that for each v_{jk}Î
S, vertices j and k are included in the set of vertices V, and a directed edge
(j,k) is added in the set of edges E. The goal is to select as many edges of
this graph as possible. Equivalently, we are trying to include in our final
sequence as many two-word sequences of maximal score as possible.

We have to note here that adopting such an approach won’t guarantee that the
final sequence, i.e. the result of solving all sub-problems, will have the
maximum score. Maximizing the score of each sub-problem does not necessarily
yield a maximal solution, because including a particular two-word sequence in
the solution of sub-problem i reduces the options left for sub-problem i-1.
Consider the following two-word sequence: "emerge", "merger". Their score is 25
(overlap of size 5). Including this sequence will yield "emerge" unusable as the
first word in a two-word sequence again (now it is bound to "merger"), and
similarly, "merger" cannot be used as the second word again. Therefore, when we
move to the next sub-problem, where we will be trying to include pairs of
overlap 4, we will have lost one candidate pair whose first word is "emerge" and
whose second one starts with "erge". Also, we might miss a pair whose second
word is "merger" and whose first one ends with "merg". Each of these pairs would
yield a score of 16, total 32, which is better than 25. However, it is not true
that in such a case we should necessarily opt for the two pairs of score 16
each, instead of the one pair of score 25. After all, we get a score of 25 by
just using two words and producing one new (so one word less), whereas the 32 is
the result of using four words to produce two new ones (so two words less).
Clearly, having more words, in principle leaves us with more options. Also, for
overlaps less than or equal to 3, it is always advantageous to include as many
of the maximum score pairs as possible. To see that, consider the dilemma of
including or not an edge of overlap i (and of score i^{2}). If we choose not to include it, i.e., choose to keep
those two words forming the edge apart, then the maximum gain we can earn by
attaching these words to other words (in the subsequent step of the algorithm)
is 2(i-1)^{2}. It can be easily verified that if i
£ 3, then 2(i-1)^{2}
< i^{2}. From the short
discussion above, the proposed greedy-like heuristic seems to be a good choice.

However, there are certain restrictions on the way edges may be selected, due
to the nature of the original problem. Recall that each word may be used only
once (since you may just rearrange them). This implies that when an edge (j,k)
representing the two-word sequence (w_{j},w_{k}) is selected, it
prevents the selection of all other edges originating at vertex j and all those
that terminate at vertex k.

Each sub-problem is NP-hard. If graph G has a Hamiltonian Circuit (a path that visits each vertex exactly one time), then this path would be a solution to our sub-problem that contains the maximum possible number of edges (|V|-1). Conversely, in a graph where a Hamiltonian Circuit exists, all solutions that select |V|-1 edges must form a Hamiltonian Circuit. However, the problem of defining if a graph has a Hamiltonian Circuit is NP-complete.

The way we tackle our problem is the following. First, we include in our solution all those edges that cause no conflicts, that is, their inclusion in the solution does not imply the exclusion of any other edge. These are the edges e = (j,k), where at vertex j there are no outgoing edges (other that e itself), and at vertex j there are no incoming edges (other than e, course). Then, we choose an edge at random, remove all conflicting edges, and repeat the process, until there are no edges left. Of course, we can apply heuristics, like choosing each time the edge that cause the minimum number of conflicting edges. This is left for future implementation. The idea behind the current implementation is to avoid heuristics other than the basic one that led to the creation of the sub-problems in the first place. Instead, the size of the problem is reduced as much as possible, including for example all non-conflicting edges as explained before, and then randomizing the algorithm, expecting that after few iterations it will hit the maximum.

After a solution for graph sub-problem i is found, each pair of words represented by the selected edges of graph G is considered one word from this point on, having the first word of the pair as its prefix and the second one as its suffix. The words represented by the pairs are removed form the word list, the new words are appended to it, and the new word list is passed on to the next sub-problem i-1. Finally, a value of the maximum score times the size of the solution set S is added to the current score.

All permutations of the word list produced by the last sub-problem yield the
same score. These sequences are obtained from the word list after all partial
sub-sequences of words created by each sub-problem (potentially nested) are
broken up to form the original words.

*Implementation*

The algorithm was implemented in the prototyping language K, which can be downloaded from the Kx systems web site. My code is stored here as snake.k. If you install K in Windows, open a command window and type:

k snake word-file-name

K is an interpreted language, so typing the above line at the Windows command prompt will invoke the K interpreter and load the program. To run it, simply type:

run[x]

at the K prompt, where x is the number of iteration you want the program to perform (remember the algorithm is randomized). Each time a better score is found, it is printed on the screen, and when the program terminates, it prints out the final wordsnake.

The input of the program (word-file-name) is simply a file containing a set of words (the words have to be in small letters), one per line. Here are some sample word files:

If you have any problem email me at tsirigos@cs.nyu.edu.