A short comment on AlphaCode

Ernest Davis
February 2022.

There is no question that AlphaCode (Li et al. 2022) is an impressive accomplishment, indeed an astonishing one. The input in English specification are long and complex. The code that AlphaCode produces is often 20 lines long, intricate, clever, and by no means cookie-cutter. The relation between the English specification and the characteristics of the code is quite indirect.

But before jumping to the conclusion that "DeepMind says its new AI coding engine is as good as an average human programmer" as, inevitably, a headline (Vincent, 2022) in the popular scientific press put it (it should be noted that DeepMind made no such claim), a couple of points should be noted.

AlphaCode generates an enormous number — in different tests sometimes 50,000, sometimes 100,000, sometimes a million — of candidate programs from the English texts. It then uses the sample input and output provided with the problem specification to filter almost all of these out as incorrect. There is a substantial component of monkeys typing Hamlet going on here. AlphaCode has succeeded in training the monkeys to a remarkable degree, but still they need a lot of them. It then produces 10 candidates, and considers it a success if one of those is correct. There is nothing inherently unfair about such an approach, but it has two consequences.

The first consequence is that it is reasonable to expect that the number of samples required increases exponentially with the length of the program, though one cannot be sure of that until one tests it experimentally. AlphaCode requires 1 million samples to get 34% correct on 20 line programs; to produce a 200 line program — the length of a standard assignment in a second-year computer science class — it might well require 1060 samples.

The second consequence is that the success of AlphaCode is completely dependent on having specific inputs and outputs provided which it can use for filtering.* Now, there is no question that having inputs and outputs provided is enormously useful for the human contestants in programming competitions. It helps the contestants understand the problem, it cuts down on debates about the meaning of specifications, and it gives convenient test examples. Nonetheless, if they were not provided, human programmers could in most cases succeed, with a little more work. By contrast, AlphaCode would be completely at a loss without the specific examples provided; the success rate would drop by a factor of about 100.

It is also worth noting that AlphaCode is by no means purely tabula rasa, end-to-end learning. The code that generates the one million candidates from the English text is more or less generic transformer technology, but the code that filters and selects the candidates is hand-coded with little or no machine learning involved. There's also some hand-coded stuff involved in preprocessing the inputs e.g. they add randomized metadata to increase the diversity of solutions they generate. This is no strike against AlphaCode, but is worth bearing in mind in discussions about the sufficiency of unassisted machine learning for AI.

Afterthought added 3/19/2022: All the problems quoted in the AlphaCode article and all the ones that I looked at in the training corpus follow a very specific form for multiple tests in the input: The first line of input is the number of tests, and then the tests are presented in the remaining lines in sequence. For a human programming contest, that makes sense: you want to standardize that kind of thing, because that's not where the challenge is, and varying it just slows down the programmers, without actually challenging them. But if AlphaCode has been fine tuned on that particular form of input and never tested on any other form, then it is not at all certain that AlphaCode would survive a change in that overall structure; it might well have been effectively hard-coded in, and changing it might break the system. And there may be other, more subtle, conventions in setting up programming contest problems that AlphaCode is relying on. There is no way to know till you experiment.

* In cases where there were multiple possible outputs, the filtering compared the output of an AlphaCode candidate, not only to the answer provided, but also to the outputs of successful submissions. Thanks to Yujia Li and David Choi for clarification on this point.

References

Li et al. , 2022. Competition-Level Code Generation With AlphaCode

Vincent, 2022. DeepMind says its new AI coding engine is as good as an average human programmer