# [FOM] NIM/Deep Learning

Harvey Friedman hmflogic at gmail.com
Sun Oct 15 15:48:10 EDT 2017

```My original posting on NIM and AI
http://www.cs.nyu.edu/pipermail/fom/2017-September/020612.html spurred
some other FOM postings, and my followup posting is at
http://www.cs.nyu.edu/pipermail/fom/2017-September/020619.html

Since then, I have been in correspondence about it with Scott Anderson
and a specialist in Deep Learning, John Schulman. John has kindly
given me permission to post his correspondence with me and Scott on
this topic.

FROM JOHN SCHULMAN
October 14, 2017
6:11PM

I think it'd be a fun project to try the latest algorithms on Nim and
see what happens. It'd serve as a good unit test for whether one has a
convergent algorithm or not.

My first thought was that it'd be really hard to define the inputs and
outputs in a way that are compatible with neural networks--variable
sized objects are always hard to deal with--but in this case there's a
neat solution.

You deal with the variable sized input (game state) by using a
recurrent neural network and feeding it one pile at a time, and you
deal with the variable sized output by defining your neural network as
a function that maps game state -> scalar score. This approach (called
using an "afterstate") was used by TD-gammon, the neural network
trained to play pro-level Backgammon back in the early 90s.

Here's pseudocode for the network that evaluates the board state:

x_i:  binary vector representing i-th pile

h_i:  real vector, network’s internal state

score: real scalar, score assigned to game state

Update, Score:  One or two layer neural nets

theta_run, theta_score:  parameters of neural nets

h_0 = Zeros(k)

h_1 = Update(h_0,x_0,theta_run)

…

h_n = Update(h_n-1,x_0,theta_run)

score = Score(h_n,theta_score)

Here, the Update needs to perform the XOR operation, and Score just
needs to sum up the bits to get the nim-sum. XORs aren't easy for
neural nets, so this might be hard to learn, but it should be doable
with enough data.

TD-Gammon, AlphaGo, and OpenAI's Dota bot all are trained using
self-play--you play the agent against itself or a randomly-selected
older version of itself, and then you update the agent as if the other
player is fixed.
This converges under some conditions which aren't satisfied in
practice, but empirically it works pretty well.

John

October 15, 2017
3:23AM

Thanks for getting this to me!

I had previously started a discussion of NIM/deep learning on the FOM
email list. I don't think the discussion got the attention of deep
people in deep learning (smile), but there are a lot of readers with a
casual interest and some awareness and some knowledge of deep learning

http://www.cs.nyu.edu/pipermail/fom/2017-September/020612.html

http://www.cs.nyu.edu/pipermail/fom/2017-September/020619.html

By the way, the second link here at the end has some NIM variants that
I have a feeling might be "humanly deep", and perhaps while humans try
to wrap their heads around them, or such things, we can run the
machines on them in tandem, and compare their progress. Man v. machine
on them as the years go by!

E.g., maybe what happens is that until the humans make the ultimate
breakthrough in these NIM variants and related games, the machines
crush us. After that, we crush them!

E.g., maybe the humans are completely clueless until the NIM variants
get small after a lot of plays, and maybe the same for the machines,
but the machines can crush the game by exhaustive search of the game
tree at sizes where the humans are clueless.

MORE FROM JOHN SCHULMAN
October 15, 2017
12:49PM

I did some preliminary experimentation: https://github.com/joschu/nim

The first question was whether one could train a neural net with
gradient descent to approximate the nim-sum: elementwise XOR of a
variable number of input bit vectors.

I tried using two different neural network structures, and both worked
straightforwardly, though one learned 5x faster than the other. The
last training run (bottom of the ipynb file) corresponds to up to 10
5-bit vectors.

Learning the optimal strategy purely with self-play (as in TD-Gammon
or the Dota bot) would be far less direct and would probably take a
huge number of samples, but I see no fundamental reason why it
wouldn't work, given a neural network that is able to approximate the
nim-sum.

I don't think it's that surprising that we can approximate the optimal
policies for Nim or its variants with neural nets. I think the
ultimate challenge would be to have a neural net that actually
searches for and verifies the mathematical rule--i.e., does some kind
of approximate Solomonoff induction. There have been some attempts to
do this kind of things (see "neural turing machines", "reinforcement
learning neural turing machines", etc) but there's a ways to go.

Scott--not sure if successful learning of Nim would be paper-worthy
for the machine learning community, unless there's something
nontrivial and general that goes into the solution. (Maybe the result
would be more interesting to other fields?) But at the very least, it
could be a fun blog post.

Harvey Friedman
```