Annals of Reviewer 2

An ongoing, occasional series. Details are changed to protect confidentiality for unpublished papers. Published papers are fair game.

December 2021
A new paper on arXiv claims "We solve MIT’s Linear Algebra 18.06 course and Columbia University’s Computational Linear Algebra COMS3251 courses with perfect accuracy by interactive program synthesis." "Our results open the door for solving other STEM courses, which has the potential to disrupt higher education by: (i) automatically learning all university level STEM courses, (ii) automatically grading course, and (iii) rapidly generating new course content."

In fact, of the 60 examples of textbook problems listed in the appendix, 8 are simply wrong, and in 6 most of the work is being done by the human in the loop rather than by Codex. Plus, the only problems considered are those involving computing this or that rather than questions that ask you to explain or prove something. Some of the successes are actually impressive, but most are just pattern-matching phrases in the prompt with calls to functions in NumPy.

November 2021
I'm reviewing a journal paper by PhD student, PhD student, PhD student, and adviser at a fine university. It contains a dozen or so formulas like

P(x) = ∃ Q(x) ^ W(x,y) ∀ (y i∈ S).
a = x | max(f(x,y) ∀ y ∈ S)
and similar horrors for set construction and for summation over a set. There are no well-formed formulas that use a quantifier or min/max anywhere in the paper.

This is not actually the most important thing that's bad about the paper -- what's most importantly bad is that it is addressing a useless problem using a method that does not generalize past a very specific toy microworld --- but that kind of flaw is excusable. This pure incompetence is not.

Maybe this is just a "grass is greener" viewpoint, but it's not my impression that doctoral students in physics, chemistry, math, economics or other areas of computer science submit papers with simple formulas that are obviously ill-formed. My impression is that it's unique to symbolic AI. Incidentally, the adviser certainly knows better; they just didn't bother reading the paper.

I don't make great claims for my abilities as a pedagogue, but at least the students in my AI classes (undergrad and graduate) come out aware that, if they write a formula like the first one above, they will be executed by firing squad at dawn. (I don't specifically teach the correct form for "max".)

November 2021
I am reviewing a grant proposal. The PI reproduces a figure from one of their papers with no explanation in the text. I can make sense of neither the label on the x-axis, the label on the y-axis, or the various acronyms that label the six curves shown. I do see clearly, though, that something goes down as a function of something else, along six different possible paths depending on some other thing, with confidence intervals that are pretty broad but do not wipe out the trend. I presume that that's a good thing. Unfortunately I obviously can't post.

September 2021
Reviewer 2 is reviewing a paper by a fine scientist and their student. It is both technically incompetent and poorly written. He presumes that the scientist didn't even bother to look at it. He spent a couple of hours of his not entirely valueless time going through it and writing up a 1000 word review explaining what's wrong with the thing. He feels that he and the other reviewers are being asked to do, as reviewers, work that the scientist should have done as an advisor.

June 2021
Automated commonsense reasoning in 2021:

< rant > I'm reviewing a paper on the topic of "Commonsense Q/A". They explain at length how their system can answer the following question that they claim is from the WIQA dataset (I haven't checked): "Suppose during boiling point happens, how will it affect more evaporation?" (Possible answers: "more", "less" or "no effect".) This is the example they chose; I didn't cherry pick it.

The new technique that they're using (combined with random BERT-based slop) is exactly the spreading activation model proposed by Ross Quillian (1968). Needless to say they don't cite Quillian. (Incidentally, they have 30 lines of pseudo-code explaining bidirectional breadth-first search, because readers who know every detail of BERT, RoBERTa and so on have never heard of it.) The network is, God help us, ConceptNet. Or rather, like at least two other papers I've reviewed in the last year, it's a stripped down version of ConceptNet, which entirely ignores the labels on the arrows because, of course, binary relations are much too hard to reason about.

It's nonsense piled on garbage tested on gibberish.

But of course it has its bold-faced results table showing that, on this challenging, well-established dataset and some others (on which MTurkers achieved 96%!) it outshines all the previous systems. _And_ an ablation study. How could I possibly justify denying it its day in the sun? < /rant >

It's possible, certainly, that the point about the arcs in ConceptNet cuts the other way: that the labels on the arcs are so vague and ad hoc that they are, in fact, useless; the only value of ConceptNet is in the nodes and the unlabelled arcs.

February 2021
Figure in a paper I'm reviewing, with the caption, "Since B is bigger than A, we can conclude ..."

My review: "If you're going to draw confidence interval bars, you should pay some attention to them."

October 2020
I'm reviewing a grant proposal for a good-sized amount of money, which (parts I-V.D) wants to do theoretical math in logic and topology, and (part V.E) will use all this math in order to do medical image analysis on pictures of tumors. They mention that the techniques they've developed have already achieved state-of-the-art on some class of images. So I looked up the paper about the tumors. Sections 1-7 are theoretical math. Section 8 gives a program listing with 8 lines of code for characterizing tumors (not including the subroutines, but those seem pretty basic). The code, of course, has absolutely nothing to do with the math, plus it's the kind of image processing that people were doing 40 years ago. It's the kind of code that a mathematician who has grudgingly spent a couple of hours reading some basic papers about medical images might write.

So I'm writing a review that says (a) the math looks cool (b) the medical imaging is completely bogus. If the funding agency wants to spend a lot of money on logic and topology, more power to them, but nobody should be pretending that this is going to help in hospitals.

September 2005
Mathematics as Metaphor. A review of Where Mathematics Comes From, by George Lakoff and Raphael Nunez. Journal of Experimental and Theoretical AI, vol. 17, no. 3, 2005, pp. 305-315.

Back in the early 1990s
The absolute worst paper I ever reviewed, back when non-monotonic logic and "the software crisis" were both hot, was entitled something like "The software crisis and nonmonotonic logic" and, without exaggeration, ran as follows:

  1. (First paragraph) The software crisis is a crisis.
  2. (To the end of p. 2) To escape the crisis and build reliable software, we will need programs that can reason non-monotonically, as illustrated by the consideration that birds can fly but penguins cannot fly.
  3. (Pages 3-25) Incompetent NML with not another word about the software crisis. The logic was so incompetent that they reversed a statement and its negation. (They were using a Prolog-like notation, and wrote "p" as ":-p." and "not p" as "p:-.") Almost needless to say, the examples in the paper were Tweety the bird and penguins --- maybe it got as far as Nixon the Quaker hawk.