Experiments testing GPT-3's ability at commonsense reasoning: results.

Gary Marcus, Robust AI
Ernest Davis, Department of Computer Science, New York University

These are the results of 157 tests run on GPT-3 in August 2020. We are extremely grateful to Douglas Summers-Stay for running the experiments; we were unable to run them ourselves because AIOpen refused to give us access to the program.

Two GPT-3 metaparameter settings were used in these experiments: "Temperature = 0", at which setting GPT-3 deterministically returns what it considers the most probable result; and the settings that Doug considers preferable for his purposes: temperature=.7, top_p=.9, frequency_penalty=.5. Nine examples were run only at Temperature=0; the rest were run at both settings.

In each example below, our prompt is in ordinary font; GPT-3's continuation is in bold. Each example is labelled with the settings at "Examples are also labelled "Success". if we consider that GPT-3's continuation of our prompt was reasonable; "Failure" if we consider it clearly unreasonable; and "Flawed" if it is nearly correct, or barely possible but clearly suboptimal. The examples are arranged in rough categories of the domain of the reasoning that would involved in getting the right answer. Comments on the examples are in italics.

We have included here all of the tests that we ran, including tests that were rerun because of minor typos in the prompt. So the list below includes a number of near-duplicates. Examples of this kind are noted in in the comments.

These experiments are not, by any means, either a representative or a systematic sample of anything. We designed them explicitly to be difficult for current natural language processing technology. Moreover, we pre-tested them on the "AI Dungeon" game which is powered by some version of GPT-3, and we excluded those for which "AI Dungeon" gave reasonable answers. (We did not keep any record of those.) The pre-testing on AI Dungeon is the reason that many of them are in the second person; AI Dungeon prefers that. Also, as noted above, the experiments included some near duplicates. Therefore, though we note that, of the 157 examples below, 71 are successes, 70 are failures and 16 are flawed, these numbers are essentially meaningless.

This collection of problems is no more than an account of haphazard experiments that the authors ran in August 2020. We publish it here for the sake of full disclosure relative to our article in MIT Technology Review. It was not designed to be used as a data set, let alone as a benchmark, and not suitable for that purpose; it is unsystematic and contains several prompts with errors and numerous near repetitions. The authors firmly discourage researchers from using it as a measure to evaluate progress in automated commonsense reasoning, and disclaim any responsibility for any consequences of such misuse. Researchers developing a suitable benchmark may find some of these examples helpful or suggestive, and are certainly welcome to reuse this material as they see fit.

A discussion of these tests may be found in our paper, "GPT-3: Bloviator", Technology Review August 22, 2020.

    Biological reasoning

  1. You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of bleach into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it.

    You are in the hospital now.

    Temperature=0. Success

  2. You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of bleach into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you take a big gulp, then another. It tastes OK.

    Summers-Stay settings. Flawed. GPT-3 does not notice the consequence of drinking bleach. Error of omission.

  3. You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of suntan lotion into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it.

    You are in the hospital.

    Temperature=0. Success

  4. You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of suntan lotion into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it. It tastes kind of funny, but you can’t place it.

    Summers-Stay settings. Flawed. GPT-3 does not notice the consequence of drinking suntan lotion. Error of omission.

  5. You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of lubricant into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty . So you drink it.

    You are now in a hospital bed.

    Temperature=0. Success

  6. You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of lubricant into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty . So you drink it.

    The first sip is OK.

    Summers-Stay settings. Flawed. GPT-3 does not notice the consequence of drinking lubricant. Error of omission.

  7. You poured yourself a glass of cranberry, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it.

    You are now dead.

    Temperature=0. Failure. GPT-3 seems to assume that grape juice is a poison, despite the facts that there are many references on the Web to cranberry-grape recipes and that Ocean Spray sells a commercial Cran-Grape drink.

  8. You poured yourself a glass of cranberry, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it.

    A few days later, you get very sick.

    Summers-Stay settings. Failure, as above. The word "juice" after "cranberry" was accidentally omitted in these, so we reran them with the corrected version below.

  9. You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it.

    You are now dead.

    The moral of the story is:

    If you are going to drink something,

    make sure you know what it is.

    Temperature=0. Failure, as above.

  10. You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you drink it. It tastes a little funny, but you don’t really notice because you are concentrating on how good it feels to drink something. The only thing that makes you stop is the look on your brother’s face when he catches you.

    Summers-Stay settings. Failure, as above. There is no reason for your brother to look concerned.

  11. You are eight years old. When you were born, your mother was in Barcelona and your father was in Tokyo, so you were born in Barcelona and raised in Tokyo. You have never been to Barcelona, but you have been to Tokyo many times.

    Temperature=0. Success

  12. You are eight years old. When you were born, your mother was in Barcelona and your father was in Tokyo, so you were born in Tokyo. You were a very cute baby.

    Summers-Stay settings. Failure. GPT-3’s continuation fails to reflect that the fact children are born in close proximity to their birth mothers, regardless of father’s location.

    Physical reasoning

  13. If you break a glass bottle that holds toy soldiers, the toy soldiers will probably

    a. be angry

    b. be sad

    c. be happy

    d. be scared

    If you break a glass bottle that holds toy soldiers, the toy soldiers will probably be angry

    Temperature=0. Flawed. Note that it was GPT-3's choice to fill this out as a multiple-choice problem; that was not part of our prompt. The answer is perhaps acceptable as a fantasy.

  14. You are making coffee with milk and sugar. You don't have a spoon to stir your coffee, so you stir it with a pen. But that turns out to be a bad idea, because the coffee is too hot, and the pen starts to melt.

    Temperature=0. Success

  15. You are making coffee with milk and sugar. You don't have a spoon to stir your coffee, so you stir it with a pen. But that turns out to be a bad idea, because the coffee gets all over the pen, and you can't write with it anymore.

    Summer-Stay settings. Flawed. It can hardly be a surprise that the coffee gets all over the pen.

  16. You order a bowl of cold tomato soup in a restaurant. It looks delicious, but they forgot to bring you a spoon. You try to drink it with a fork, but it's impossible. You can't eat it.

    Temperature=0. Success, though GPT-3 is very unspecific about why it's impossible.

  17. You order a bowl of cold tomato soup in a restaurant. It looks delicious, but they forgot to bring you a spoon. You try to drink it with a fork, but it's impossible. You finally give up, and order another one.

    Summers-Stay settings. Failure. There is no need and no point in ordering another bowl of soup or another fork, and grammatically "another one" can hardly mean a spoon.

  18. You order a bowl of cold tomato soup in a restaurant. It looks delicious, but they forgot to bring you a spoon. You try to drink it using your eyeglasses, but it doesn't work.

    You order a bowl of cold tomato soup in a restaurant.

    Temperature=0. Success. GPT-3, at temperature 0, has an undesirable tendency to repeat the prompt or parts of the output (we will see further examples below) but, since that does not demonstrate a failure of understanding, we do not count these against it.

  19. You order a bowl of cold tomato soup in a restaurant. It looks delicious, but they forgot to bring you a spoon. You try to drink it using your eyeglasses, but it doesn't work. You have to ask your neighbor to share his spoon with you.

    Summers-Stay settings. Failure. No one in a restaurant asks their neighbor to share a spoon.

  20. You order a bowl of cold tomato soup in a restaurant. It looks delicious, but they forgot to bring you a spoon. You try to drink it by pouring it into your napkin, but it's too thick. You try to eat it with your fingers, but it's too runny.

    Temperature=0. Failure. The continuation makes no sense; thinner soup would be no easier to eat with a napkin.

  21. You order a bowl of cold tomato soup in a restaurant. It looks delicious, but they forgot to bring you a spoon. You try to drink it by pouring it into your napkin, but it's too cold to be sipped through the fabric. What do you do?

    What would you do if you couldn't open your car door? Would you try to open it by punching the window?

    What would you do if your clothes were stolen from the laundry room? Would you call the police?

    If someone told you that she was too drunk to drive, what would you tell her to do?

    You're in a race and are running down the street.

    Summers-Stay settings. Failure. The continuation makes no sense; warmer soup would not work any better and you would not be sipping it through the fabric. GPT-3 then descends into random blathering.

  22. You order a bowl of cold tomato soup in a restaurant. It looks delicious, but they forgot to bring you a spoon. You try to use your plate to scoop it out of the bowl, but it's too big. You try to use your fork, but it's too small.

    Temperature=0. Failure. The immediate continuation "It's too big" is right, but the continuation about the fork is wrong.

  23. You order a bowl of cold tomato soup in a restaurant. It looks delicious, but they forgot to bring you a spoon. You try to use your plate to scoop it out of the bowl, but it's too wide. You ask the waiter for a spoon, but he brings you a butter knife.

    Summers-Stay settings. Failure. "It's too wide" is correct, but it is implausible that the waiter would bring you a butter-knife.

  24. You are baking bread so you need flour. But someone put a sack of potatoes on top of the bag of flour. So in order to get the flour you need to move the potatoes. But the potatoes are heavy. So you need to get a friend to help you move the potatoes. But the friend is busy. So you need to wait until the friend is free. But you are in a hurry. So you need to find someone else to help you move the potatoes. But no one is around. So you need to go and find someone.

    Temperature=0. Success

  25. You are baking bread so you need flour. But someone put a sack of potatoes on top of the bag of flour. So in order to get the flour you need to move the potatoes. Then you discover that someone has stolen the butter.

    Summers-Stay settings. Failure. "Move the potatoes" is correct, but "stolen the butter" is a non-sequitur.

  26. You need flour to bake bread. You have a sack of flour in the garage. When you get there, you find that a sparrow has built a nest on top of it and there are three eggs in the nest. So you have to wait until the eggs hatch.

    You need flour to bake bread.

    Temperature=0. Success

  27. You need flour to bake bread. You have a sack of flour in the garage. When you get there, you find that a sparrow has built a nest on top of it and there are three eggs in the nest. So you have to go to the store to buy some bread.

    You go to the store, but they don't have any bread.

    Summers-Stay settings. Success

  28. You need flour to bake bread. You have a sack of flour in the garage. When you get there, you find that it got thoroughly soaked in a heavy rain last night. So you have to dry it out before you can use it.

    You can do this by spreading it out on a table and putting a fan on it.

    Temperature=0. Failure. Flour that has gotten soaked has to be thrown out; drying it will not help.

  29. You need flour to bake bread. You have a sack of flour in the garage. When you get there, you find that it got thoroughly soaked in a heavy rain last night. So you have to dry it out. To do that, you spread it out on a tarp in the sun.

    Summers-Stay settings. Failure. As above.

  30. You need flour to bake bread. You have a sack of flour in the garage. When you get there, you find on top of it a hat that you thought you had lost months ago. So you take the hat and put it on your head. You go back to the kitchen and find that you have forgotten to buy eggs.

    Temperature=0. Failure. The immediate continuation, that you put the hat on your head, is correct, but then GPT-3 loses track of the flour.

  31. You need flour to bake bread. You have a sack of flour in the garage. When you get there, you find on top of it a hat that you thought you had lost months ago. So you fill the sack with flour, and add the hat to your head.

    So you have flour and a hat.

    Summers-Stay settings. Failure. You are taking flour out of the sack, not filling it with flour.

  32. You need flour to bake bread. You have a sack of flour in the garage. When you get there, you find that the flour is at the bottom of the sack; it is covered with several layers of potatoes. So you have to take the potatoes out of the sack.

    You need flour to bake bread.

    Temperature=0. Success

  33. You need flour to bake bread. You have a sack of flour in the garage. When you get there, you find that the flour is at the bottom of the sack; it is covered with several layers of potatoes. So you have to wash the potatoes off. You turn on the light in the garage, and you find that a rat has eaten a hole in the sack.

    Summers-Stay settings. Failure. Washing the potatoes off is irrelevant

  34. You need flour to bake bread. You have a sack of flour in the garage. When you get there, you find that the flour is at the bottom of the sack, but that somebody emptied a couple of boxes of dishwashing detergent on top of it. So you have to dig through the detergent to get to the flour.

    That’s what it’s like to try to find a file in a directory that has been indexed by Spotlight.

    Temperature=0. Failure. It would not be safe to use flour that has had detergent on top of it; certainly, you would not "dig through" the detergent. The following paragraph is just a strange digression.

  35. You need flour to bake bread. You have a sack of flour in the garage. When you get there, you find that the flour is at the bottom of the sack, but that somebody emptied a couple of boxes of dishwashing detergent on top of it. So you have to dig through the detergent to get to the flour.

    As you're digging, you notice that the detergent is a pretty good match for the flour.

    Summers-Stay settings. Failure. As above.

  36. At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because the cigarette butt got stuck in the glass, and I couldn’t get it out. I tried to fish it out with a straw, but the straw broke.

    Temperature=0. Success.

  37. At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cigarette. But that turned out to be a bad idea because it kept falling on the floor.”

    That’s when he decided to start the Cremation Association of North America, which has become a major cremation provider with 145 locations.

    Summers-Stay settings. Failure. GPT-3 first mispredicts the consequences of stirring lemonade with a cigarette and then wanders off into irrelevant nonsense.

  38. At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cookie. But that turned out to be a bad idea because the cookie crumbled and made the lemonade cloudy. So I tried to fish out the crumbs with a fork, but I couldn’t get them all, and I ended up drinking a lot of crumbs.

    Temperature=0. Success

  39. At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a cookie. But that turned out to be a bad idea because the cookie crumbled and made the lemonade gritty. So I decided to eat the cookie to get rid of the crumbs.

    Summers-Stay settings. Flawed. Once the cookie has crumbled inside the lemonade, one can hardly eat the cookie.

  40. At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a paper napkin. But that turned out to be a bad idea because the napkin disintegrated. I had to use my fingers to get the sugar off my hands.

    Temperature=0. Success

  41. At the party, I poured myself a glass of lemonade, but it turned out to be too sour, so I added a little sugar. I didn’t see a spoon handy, so I stirred it with a paper napkin. But that turned out to be a bad idea because the napkin disintegrated in the glass. After I finished the drink, I threw the napkin away in a wastebasket—but when I picked up the glass to wash it out, there was a big black spot on my hand.

    Summers-Stay settings. Failure. GPT-3 starts off well, but you would hardly finish the drink with disintegrated napkin; if you did you could not then throw out the napkin; and there is no reason that any of this would leave a big black spot on your hand.

  42. You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to remove the door. You have a table saw, so you cut the door in half and remove the top half.

    Temperature=0. Failure. This is one confusion after another. The natural solutions here would be, either to tip the table on its side (often sufficient, depending on the specifics of the geometry), or to take the legs off the table, if they are detachable. Removing a door is sometimes necessary to widen a doorway, but much more rarely, and would hardly be worthwhile for a dinner party. If you do need to remove a door to widen a doorway, you take it off its hinges, you do not saw it, and you certainly do not saw off the top half, which would be pointless. Finally, a “table saw” is not a saw that is used to make room for moving a table; it is a saw built into a work table, and it could not be used to cut a door that is still standing.

  43. You are having a small dinner party. You want to serve dinner in the living room. The dining room table is wider than the doorway, so to get it into the living room, you will have to take it apart. This is a very difficult task.

    Summers-Stay settings. Success

  44. Never in his life has Bashan caught a hare, nor will he ever; the thing is as good as impossible. Many dogs, they say, are the death of a hare, a single dog cannot achieve it, even one much speedier and more enduring than Bashan. The hare can ``double'' and Bashan cannot --- and that is all there is to it. How Bashan runs! It is beautiful to see a creature expending the utmost of its powers. He runs better than the hare does, he has stronger muscles, the distance between them visibly diminishes before I lose sight of them. And I make haste too, leaving the path and cutting across the park towards the river-bank, reaching the gravelled street in time to see the chase come raging on— the hopeful, thrilling chase, with Bashan on the hare’s very heels; — “One more push, Bashan!” I think, and feel like shouting; “Well run, old chap, remember the double!” But there it is; Bashan does make one more push, and the misfortune is upon us; the hare gives a quick, easy, almost malicious twitch at right angles to the course, and Bashan , with a despairing howl, is left behind. He is not a dog to howl, but he howls now, and I am sorry for him.

    Temperature=0. Success. The passage is from "A Man and his Dog" by Thomas Mann. The actual continuation is "Bashan shoots past from his rear. Before he can stop, turn around, and get going in the other direction, the hare has gained so much ground that it is out of sight."

  45. Never in his life has Bashan caught a hare, nor will he ever; the thing is as good as impossible. Many dogs, they say, are the death of a hare, a single dog cannot achieve it, even one much speedier and more enduring than Bashan. The hare can ``double'' and Bashan cannot --- and that is all there is to it. How Bashan runs! It is beautiful to see a creature expending the utmost of its powers. He runs better than the hare does, he has stronger muscles, the distance between them visibly diminishes before I lose sight of them. And I make haste too, leaving the path and cutting across the park towards the river-bank, reaching the gravelled street in time to see the chase come raging on— the hopeful, thrilling chase, with Bashan on the hare’s very heels; — “One more push, Bashan!” I think, and feel like shouting; “Well run, old chap, remember the double!” But there it is; Bashan does make one more push, and the misfortune is upon us; the hare gives a quick, easy, almost malicious twitch at right angles to the course, and Bashan is left behind, and is speeding in the opposite direction. I can see him stop and look round with his short tail between his legs, then he sits down dejected, and waits for me to come up.

    Summers-Stay settings. Flawed. "Speeding in the opposite direction" is not right.

  46. You're getting dressed and you find that your belt has broken. For a second you think maybe you can use a rubber band, but then you realize that, of course, you don't have any. You're going to have to go without a belt.

    Temperature=0. Failure. The problem is that a rubber band would not be nearly long enough.

  47. You're getting dressed and you find that your belt has broken. For a second you think maybe you can use a rubber band, but then you realize that, of course, you don't have one. What do you do?.

    Summers-Stay settings. Failure. As above.

  48. You are getting dressed and you find that your belt has broken. Maybe you could wear your pants without a belt, but they are cut rather large, and you run the risk of them falling down. You could go to the store and buy a new belt, but you are in a hurry and it is cold outside.

    Temperature=0. Success.

  49. You are getting dressed and you find that your belt has broken. Maybe you could wear your pants without a belt, but they are cut rather large, and you run the risk of having them fall down at inopportune moments. So, what can you do? Get a new belt? Not really, you’ve got a pair of scissors and some of that stretchy string that they use to secure packages with.

    Summers-Stay settings. Success

  50. You live in Cincinnati. It’s Valentine’s Day today, and you need to buy flowers for your significant other. You are getting dressed to go to the florist. You find that your jeans have a large hole in them. You think maybe you could wear shorts, but then you realize that you don’t have any. You decide to wear a skirt, but you don’t have one.

    Temperature=0. Failure. The problem with shorts is that they would be too cold. You would hardly decide to wear a skirt if you don't have one.

  51. You live in Cincinnati. It’s Valentine’s Day today, and you need to buy flowers for your significant other. You are getting dressed to go to the florist. You find that your jeans have a large hole in them. You think maybe you could wear shorts, but then you realize that you don’t have any clean shorts. You could wear your suit pants, but that would be really uncomfortable.

    Summers-Stay settings. Success

  52. You’re in the fourth grade. The only clean pants that are in your closet are the white ones that everyone at school makes fun of. You think your father would let you borrow a pair of his pants, but then you realize that he’s wearing the only pair of clean pants in the house. You’re in the fourth grade.

    Temperature=0. Failure. The problem with wearing your father's pants is that they won't fit. Since your white pants are clean, your father is not wearing the only pair of clean pants in the house.

  53. You’re in the fourth grade. The only clean pants that are in your closet are the white ones that everyone at school makes fun of. You think your father would let you borrow a pair of his pants, but then you realize he doesn’t even have any white pants. He has only black ones.

    Summers-Stay settings. Failure. As above, GPT-3 misses the fact that your father's pants won't fit. The fact that he has no white pants only black ones would not in itself be a problem.

  54. Sheila and Jack had finished work at the office for the day and were chatting together as they went home. Sheila pressed the button for the elevator. The elevator doors opened, but the elevator car wasn’t there; behind the doors was an empty elevator shaft. Sheila was too busy talking to see what had happened, so, still looking at Jack, she stepped in. Jack saw what was happening and tried to grab her, but he was too late. Sheila fell down the elevator shaft.

    Temperature=0. Success. This is adapted from an episode of LA Law, discussed in Rebooting AI (Marcus and Davis, 2019).

  55. Sheila and Jack had finished work at the office for the day and were chatting together as they went home. Sheila pressed the button for the elevator. The elevator doors opened, but the elevator car wasn’t there; behind the doors was an empty elevator shaft. Sheila was too busy talking to see what had happened, so, still looking at Jack, she stepped in. As she did, her foot slipped and she started to fall down the elevator shaft. Jack grabbed her hand, but couldn’t hold on and she fell the rest of the way.

    Summers-Stay settings. Success

    Temporal Reasoning

  56. My meeting started at 3:30 and I needed to catch the train at 4:30, so there wasn't much time. Luckily, the meeting ended early and I was able to catch the train. I was able to get a seat, but I was still a little nervous.

    Temperature=0. Success

  57. My meeting started at 3:30 and I needed to catch the train at 4:30, so there wasn't much time. Luckily, the meeting ended early, but I still didn't have enough time to grab lunch. I did have time to drop by the Union Square Whole Foods to see if they had any good soup.

    Summers-Stay settings. Failure GPT-3 starts off right with "early", but then contradicts itself.

  58. My meeting started at 3:30 and I needed to catch the train at 4:30, so there wasn't much time. Luckily the train station was only a few blocks away. I walked in and saw a sign that said "Trains to New York.

    Temperature=0. Success

  59. My meeting started at 3:30 and I needed to catch the train at 4:30, so there wasn't much time. Luckily the train station was just a few blocks away, so I took a quick walk and made it back just in time to catch the train.

    So far it's been a pretty good day.

    Summers-Stay settings. Flawed. If there isn't much time it is unlikely that one could go for a quick walk.

  60. My meeting started at 3:30 and I needed to catch the train at 4:30, so there wasn't much time. Unfortunately, the meeting was in the opposite direction of the train station, so I had to run. I was able to make it to the station with about 10 minutes to spare.

    Temperature=0. Success

  61. My meeting started at 3:30 and I needed to catch the train at 4:30, so there wasn't much time. Unfortunately, the meeting was a little longer than I expected, and I got to the station at 4:05. I was so bummed that I wouldn't be able to make the train, but luckily, I was able to catch the 5:00 train.

    Summers-Stay settings. Failure. The times here are all confused. If the meeting was longer than you expected, it is extremely unlikely that you would get to the station by 4:05, and if you got to the station by 4:05, you should be able to catch the 4:30 train.

  62. Moshe posted a photograph on Facebook that shows Pablo Picasso, Salvador Dali, Andy Warhol, Gustav Klimt, and Jackson Pollock. Klimt died in 1918. Warhol was born in 1928. So the photograph is a fake.

    The photograph is a fake.

    Temperature=0. Success. As discussed above, the repetition is characteristic of GPT-3 at temperature 0. This seems impressive; however, one should not conclude that GPT-3 compared Klimt's death date to Warhol's birth date and concluded that they could not be photographed together. It would be interesting to run some systematic experimentation to test what is the range of problems of this kind that GPT-3 can solve. Incidentally, a faked photograph of this description was indeed posted on Facebook.

  63. Moshe posted a photograph on Facebook that shows Pablo Picasso, Salvador Dali, Andy Warhol, Gustav Klimt, and Jackson Pollock. Klimt died in 1918. Warhol was born in 1928. So the photograph shows a meeting of dead artists.

    Moshe said that he wants to make sure that people do not get caught up in the spirit of the season.

    Summers-Stay settings. Failure. "shows a meeting of dead artists" is correct, but the sentence after than is meaningless.

  64. Moshe posted on Facebook a photograph showing Maurice Ravel, Francois Poulenc, Frederic Mompou, and Erik Satie. Satie died in 1925. Poulenc was born in 1899. So the photograph must have been taken in the 1920s.

    Temperature=0. Success. This is not quite right; if there were such a photograph, it could have been taken in the late 1910s. Mompou and Poulenc would have been young, but not impossibly young. However, it's close enough.

  65. Moshe posted on Facebook a photograph showing Maurice Ravel, Francois Poulenc, Frederic Mompou, and Erik Satie. Satie died in 1925. Poulenc was born in 1899. So the photograph must  have been taken in 1926.

    Summers-Stay settings. Failure. Close but no cigar. No one is photographed a year after they die.

    Spatial reasoning

  66. You’re vacationing in Fairfield. You can see Lobster Island right across the bay from you, due east, and you’d love to take a hike there, so you ask the hotel keeper how to get there. She answers, “The only bridge to Lobster Island is the one connecting it to Newport. Newport is about three miles north of here; you’ll pass through Somerset. That’s where my father lives, in Somerset, right off Route 202. If you find yourself in Gloucester – that’s the place with the huge, stone, Masonic lodge in the center of town – you’ve gone too far. The fastest way on foot is to go up Route 202, or you can walk up the beach. Either way, you’ll see the bridge once you get close.”

    You have plenty of time, so you decide to go up the beach. You walk for about an hour, and you’re just about to give up when you see the bridge.

    Temperature=0. Success

  67. You’re vacationing in Fairfield. You can see Lobster Island right across the bay from you, due east, and you’d love to take a hike there, so you ask the hotel keeper how to get there. She answers, “The only bridge to Lobster Island is the one connecting it to Newport. Newport is about three miles north of here; you’ll pass through Somerset. That’s where my father lives, in Somerset, right off Route 202. If you find yourself in Gloucester – that’s the place with the huge, stone, Masonic lodge in the center of town – you’ve gone too far. The fastest way on foot is to go up Route 202, or you can walk up the beach. Either way, you’ll see the bridge once you get close.”

    You have plenty of time, so you decide to go the beach route. It’s a fine day and you get a lot of exercise.

    Summers-Stay settings. Success

  68. You’re vacationing in Fairfield. You can see Lobster Island right across the bay from you, due east, and you’d love to take a hike there, so you ask the hotel keeper how to get there. She answers, “The only bridge to Lobster Island is the one connecting it to Newport. Newport is about three miles north of here; you’ll pass through Somerset. That’s where my father lives, in Somerset, right off Route 202. If you find yourself in Gloucester – that’s the place with the huge, stone, Masonic lodge in the center of town – you’ve gone too far. The fastest way on foot is to go up Route 202, or you can walk up the beach. Either way, you’ll see the bridge once you get close.”

    You set out along route 202 and start walking at a brisk pace. You get to thinking about your daughter and the trouble she’s having getting used to college life, and you stop paying attention. Suddenly you look up, because there, in front of you, is a five story, granite building with a sign, “Order of the Masons: Lodge 575”. You ’re in Gloucester. You’ve gone too far.

    Temperature=0. Success. This is one of GPT-3's more impressive successes in our experiments.

  69. You’re vacationing in Fairfield. You can see Lobster Island right across the bay from you, due east, and you’d love to take a hike there, so you ask the hotel keeper how to get there. She answers, “The only bridge to Lobster Island is the one connecting it to Newport. Newport is about three miles north of here; you’ll pass through Somerset. That’s where my father lives, in Somerset, right off Route 202. If you find yourself in Gloucester – that’s the place with the huge, stone, Masonic lodge in the center of town – you’ve gone too far. The fastest way on foot is to go up Route 202, or you can walk up the beach. Either way, you’ll see the bridge once you get close.”

    You set out along route 202 and start walking at a brisk pace. You get to thinking about your daughter and the trouble she’s having getting used to college life, and you stop paying attention. Suddenly you look up, because there, in front of you, is a five story, granite building with a sign, “Order of the Masons: Lodge 575”. You realize you’ve walked straight past the bridge and the beach, and you’re in Gloucester. You turn around and begin heading back.

    Summers-Stay settings. Success

  70. You want to mail an eight-inch diameter circular china plate to your mother. The only box you have is five-by-five-by-two inches. You will have to pack the plate in a box that is too small. You will have to pack the plate in a box that is too small.

    Temperature=0. Failure. It is not possible to pack a plate in a box that is too small. You will have to get another box.

  71. You want to mail an eight-inch diameter circular china plate to your mother. The only box you have is five-by-five-by-two inches. You will have to make two box cuts to get the plate in. (a) How many inches of tape will you need to make the first cut? (b) How many inches of tape will you need to make the second cut?

    6.

    Summers-Stay settings. Failure. Cutting the box will not make it large enough to hold the plate. GPT-3 wanders off into nonsense.

  72. You want to mail an eight-inch diameter circular china plate to your mother. The only box you have is twelve-by-twelve-by-four inches. You will have to pack the plate in a box that is too small. You will have to use a lot of packing material to fill the empty space.

    Temperature=0. Success

  73. You want to mail an eight-inch diameter circular china plate to your mother. The only box you have is twelve-by-twelve-by-four inches. You will have to roll the plate and pack it in soft material to prevent it from breaking.

    A small china plate can be rolled in a newspaper.

    Summers-Stay settings. Failure. GPT-3 is confused. A plate cannot be rolled up, and since the box is much larger than the plate, rolling up the plate would be unnecessary even if it were possible. Cutting the box will not make it large enough to hold the plate. GPT-3 wanders off into nonsense.

  74. You want to mail a beautiful eight-inch diameter circular piece of fabric to your mother. The only box you have is five-by-five-by-two inches. You will have to fold the fabric to fit. What is the largest circle you can make that will fit in the box?

    The answer is not obvious.

    Temperature=0. Failure. GPT-3 starts off correct, but its follow-up question about "the largest circle" is irrelevant.

  75. You want to mail a beautiful eight-inch diameter circular piece of fabric to your mother. The only box you have is five-by-five-by-two inches. You will have to fold the fabric in half. How do you fold the fabric to fit in the box?

    The first step is to fold the fabric in half.

    Summers-Stay settings. Failure. GPT-3 is correct that you have to fold the fabric, but folding it in half will not suffice to make it fit.

  76. Your eight-year-old son has put together a 5000-piece, 2 foot by 2 foot, jigsaw puzzle, with a picture of a battle of heroes battling against a monster. He is immensely proud of it and wants to keep it “forever” in his bureau drawer. But the drawer is only 18 inches wide. You tell him that you will help him put it in a frame and hang it on the wall. He is not happy with this idea.

    Temperature=0. Success

  77. Your eight-year-old son has put together a 5000-piece, 2 foot by 2 foot, jigsaw puzzle, with a picture of a battle of heroes battling against a monster. He is immensely proud of it and wants to keep it “forever” in his bureau drawer. But the drawer is only 18 inches wide. You tell him that you will help him fold the puzzle down to fit in the drawer.

    You fold the puzzle down, but now some of the pieces are not attached to any others.

    Summers-Stay settings. Flawed. "Fold the puzzle down" is not meaningful.

    Reasoning about perception

  78. The year is 2075. Violet Blair is an astronaut on the Moon, in the Sea of Tranquility. Right now, on Earth, there is a solar eclipse. Violet looks in the direction of the Earth and sees the Moon blocking the Sun. She is looking at the Earth through the Moon.

    Temperature=0. Failure. During an solar eclipse, a viewer on Earth sees the Moon blocking the Sun, but a viewer on the near side of the Moon would see the moon shadow passing across the Earth.

  79. The year is 2075. Violet Blair is an astronaut on the Moon, in the Sea of Tranquility. Right now, on Earth, there is a solar eclipse. Violet looks in the direction of the Earth and sees the Sun's corona, a beautiful ring of light. She's distracted by a loud sound from somewhere else on the Moon.

    Summers-Stay settings. Failure. During a total solar eclipse, a viewer on Earth can see the corona, but a viewer on the near side of the Moon cannot see the sun at all; GPT-3 does not understand the simple spatial relations involved here. Also, there are no sounds on the Moon.

  80. The year is 2075. Violet Blair is an astronaut on the Moon, in the Sea of Tranquility. Right now, on Earth, there is a lunar eclipse. Violet looks in the direction of the Earth and sees the Moon's shadow moving across the Earth. She is thinking about her husband, who is on Earth, and her daughter, who is on the Moon.

    Temperature=0. Failure. During a lunar eclipse, a viewer on the near side of the moon would see the earth blocking the sun, with the atmosphere around the Earth forming a bright halo due to refraction.

  81. The year is 2075. Violet Blair is an astronaut on the Moon, in the Sea of Tranquility. Right now, on Earth, there is a lunar eclipse. Violet looks in the direction of the Earth and sees the shadow of the Earth passing over the face of the Moon. The shadow of the Earth is called the "umbra".

    Summers-Stay settings. Failure. As above.

  82. The year is 2075. Violet Blair is an astronaut on the Moon. She is the first astronaut to visit the far side of the Moon. Right now, on Earth, there is a solar eclipse. Violet looks in the direction of the Earth and sees the eclipse. She is the first person to see the Earth from the Moon.

    Temperature=0. Failure. From the far side of the Moon, the Earth is never visible. If you are looking in the direction of the Earth, you are looking down at the Moon's surface. During a solar eclipse, the far side is in daylight, so you would see the Moon's surface at your feet.

  83. The year is 2075. Violet Blair is an astronaut on the Moon. She is the first astronaut to visit the far side of the Moon. Right now, on Earth, there is a solar eclipse. Violet looks in the direction of the Earth and sees a beautiful diamond ring shape. She is excited and wants to share this moment with the Earth.

    Summers-Stay settings. Failure. As above.

  84. The year is 2075. Violet Blair is an astronaut on the Moon. She is the first astronaut to visit the far side of the Moon. Right now, on Earth, there is a lunar eclipse. Violet looks in the direction of the Earth and sees the Moon blocking the Sun. She is the first person to see a lunar eclipse.

    Temperature=0. Failure. During a lunar eclipse, the far side of the Moon is in darkness. So Violet would be looking toward her feet, but would see nothing but blackness.

  85. The year is 2075. Violet Blair is an astronaut on the Moon. She is the first astronaut to visit the far side of the Moon. Right now, on Earth, there is a lunar eclipse. Violet looks in the direction of the Earth and sees the eclipse. She remembers her mother and sister, who died in a car accident.

    Summers-Stay settings. Failure. As above.

  86. You are sitting in a theater, and you have a great view of the stage, because you carefully sat behind someone who is very short. You are watching a play, and you are enjoying it very much.

    Temperature=0. Success

  87. You are sitting in a theater, and you have a great view of the stage, because you carefully sat behind someone who is very short. However, you are a little hungry and decide to order a box of popcorn.

    Summers-Stay settings. Success

  88. It was a chilly March day. Andrea was sitting on the beach. Hector came up to her and smiled. She wondered whether he was wearing the shirt she had given him, but she couldn’t see, because he was wearing a jacket.

    “Hi,” he said.

    Temperature=0. Success

  89. It was a chilly March day. Andrea was sitting on the beach. Hector came up to her and smiled. She wondered whether he was wearing the shirt she had given him, but she couldn’t see, because he was wearing a parka.

    Carla was talking to a friend of hers when she saw a man walking towards her.

    Summers-Stay settings. Success. The switch to Carla is a mild non-sequitur, but certainly possible.

    Psychological reasoning

  90. You need to go to the library to get a book. You have forgotten the address of the library, but you know that your friend Mary knows. So you ask her.

    Mary: "The library is at the corner of Main and Elm."

    You: "Thanks, Mary. I'll go there now."

    Mary: "You're welcome."

    You: "Bye"

    Mary: "Bye"

    You go to the library and get the book.

    Temperature=0. Success

  91. You need to go to the library to get a book. You have forgotten the address of the library, but you know that your friend Mary knows. So you write her a letter and ask her. The next day, she writes back and gives you the address. You then go to the library, get the book, and bring it back to your house.

    Summers-Stay settings. Success.

  92. A cat was sitting next to a mouse hole in the kitchen wall waiting for the mouse, but it was too cautious, so the cat was waiting for a long time.

    The cat was getting hungry, so it was thinking of going to the market to buy some food.

    Temperature=0. Flawed. Cats do not buy food at a market. On the other hand, this kind of anthropomorphism is very common.

  93. A cat was sitting next to a mouse hole in the kitchen wall waiting for the mouse, but it was too cautious, so the cat ate the cat food instead.

    A pig was walking down the street when a car hit him.

    Summers-Stay settings. Failure. The switch to the pig is a non-sequitur.

  94. Anna had been studying hard for the physics test for davs; she had reviewed every page of the textbook and had done hundreds of practice problems. Lucy, on the other hand, had been busy having a great time with her friends all week. She glanced over the textbook the night before the exam; it looked more or less familiar. Not surprisingly, therefore, Lucy did not do as well as Anna on the test.

    The next day, Lucy was very upset.

    Temperature=0. Success

  95. Anna had been studying hard for the physics test for davs; she had reviewed every page of the textbook and had done hundreds of practice problems. Lucy, on the other hand, had been busy having a great time with her friends all week. She glanced over the textbook the night before the exam; it looked more or less familiar. Not surprisingly, therefore, when the test was handed out, Lucy did fine and Anna did poorly. It didn't take long for Lucy to realize that she'd just had a lucky break; she'd done well on the test not because she understood the concepts and principles, but because she was familiar with the words in the text.

    Summers-Stay settings. Failure. It is of course possible that Lucy might do well and Anna poorly, but it is certainly not consistent with "Therefore". The word "days" was misspelled as "davs" in these attempts, so we reran them with the error corrected.

  96. Anna had been studying hard for the physics test for days; she had reviewed every page of the textbook and had done hundreds of practice problems. Lucy, on the other hand, had been busy having a great time with her friends all week. She glanced over the textbook the night before the exam; it looked more or less familiar. Not surprisingly, therefore, Lucy did not do as well on the test as Anna did.

    The next day, Lucy was in a foul mood.

    Temperature=0. Success

  97. Anna had been studying hard for the physics test for days; she had reviewed every page of the textbook and had done hundreds of practice problems. Lucy, on the other hand, had been busy having a great time with her friends all week. She glanced over the textbook the night before the exam; it looked more or less familiar. Not surprisingly, therefore, Lucy had no trouble getting a good grade on the test. However, if Lucy had been smart, she would have studied just as hard as Anna did.

    Summers-Stay settings. Failure. Again, it is possible that Lucy might do well, but it is not consistent with "Therefore".

  98. Dr. Jackson stepped swiftly forward, seized the poker, and bent it into a curve with his huge brown hands.

    “See that you keep yourself out of my grip,” he snarled, and hurling the twisted poker into the fireplace he strode out of the room.

    “He seems a very amiable person,” said Charles, laughing. “I am not quite so bulky, but if he had remained, I might have shown him that my grip was not much more feeble than his own.” As he spoke, he picked up the poker, and he and I examined it together. “Yes,” he said, “this is a singularly strong and thick one.

    Temperature=0. Failure. For narrative coherence Charles should do something to indicate that his grip is as strong as Dr. Jackson's. This passage is adapted from the Sherlock Holmes story "The Speckled Band" (the original names are "Dr. Roylott" and "Holmes") where Holmes straightens the poker.

  99. Dr. Jackson stepped swiftly forward, seized the poker, and bent it into a curve with his huge brown hands.

    “See that you keep yourself out of my grip,” he snarled, and hurling the twisted poker into the fireplace he strode out of the room.

    “He seems a very amiable person,” said Charles, laughing. “I am not quite so bulky, but if he had remained, I might have shown him that my grip was not much more feeble than his own.” As he spoke, he picked up the poker, and he and Sir Henry examined the end, which was bent into a curve resembling a half-moon.

    “Steel,” he remarked, “steel of the best quality.

    Summers-Stay settings. Flawed. As above

  100. Two children, Chloe and Alexander, went for a walk. They both saw a dog and a tree. Alexander also saw a cat, and pointed it out to Chloe. She went to pet the cat. Chloe was not aware that Alexander had seen the cat.

    The cat was not aware that Alexander had seen the cat.

    Temperature=0. Failure. Since Alexander pointed the cat out to Chloe, clearly Chloe was aware that Alexander had seen the cat. GPT-3 fails at this simple chain of reasoning about knowledge; it is also an illustration of how GPT fails to update discourse models. This example is discussed in Rebooting AI

  101. Two children, Chloe and Alexander, went for a walk. They both saw a dog and a tree. Alexander also saw a cat, and pointed it out to Chloe. She went to pet the cat. Chloe was not impressed. She threw a rock at the cat, saying "I hate cats!".

    Summers-Stay settings. Failure. If Chloe is going to pet the cat, it is implausible that she would throw a rock at it.

  102. Four children go camping. It starts to rain. Their tent starts to leak. They get wet and cold. Suddenly they see a bear and there is a big commotion. Everyone is frightened, but they yell really loudly, and the bear runs away. The thing that they still need to deal with after the bear runs away is the leaky tent.

    This is a great book for teaching children how to deal with a scary situation.

    Temperature=0. Success

  103. Four children go camping. It starts to rain. Their tent starts to leak. They get wet and cold. Suddenly they see a bear and there is a big commotion. Everyone is frightened, but they yell really loudly, and the bear runs away. The thing that they still need to deal with after the bear runs away is the water that is pouring into their tent.

    This book is a great reminder to me that I have spent my life trying to avoid problems.

    Summers-Stay settings. Flawed. The immediate continuation is fine, but the next sentence is meaningless.

  104. Four children go camping. It starts to rain. Their tent starts to leak. They get wet and cold. Suddenly they see a bear and there is a big commotion. Everyone is frightened, but they yell really loudly, and the bear runs away. Briefly they are elated. But eventually they realize that the thing that they still need to deal with after the bear runs away is the fact that their tent is leaking.

    The story is a metaphor for the way that we deal with our problems.

    Temperature=0. Success. The second sentence in the continuation is pretty much meaningless, but "metaphor" is such a loose category that I hesitate to mark this wrong.

  105. Four children go camping. It starts to rain. Their tent starts to leak. They get wet and cold. Suddenly they see a bear and there is a big commotion. Everyone is frightened, but they yell really loudly, and the bear runs away. Briefly they are elated. But eventually they realize that the thing that they still need to deal with after the bear runs away is the tent that is still leaking.

    In the story, the children are like people.

    Summers-Stay settings. Failure. The immediate continuation is right, but the next sentence is absolutely meaningless; children are always like people. It is rather striking that, in both our experiments with this particular story, GPT-3 turns to a meaningless meta-level comment. It would be interesting to test how reliable that is.

  106. Chris and Jamie meet at a bar, and find each other attractive. They start chatting, and get to know each other, and find that they even friends in common; they really hit it off.

    They pause when the bartender asks them what they'd like to drink. After that, everyone in the room starts shouting, because the local football team has just made it to the playoffs. Someone offers to buy a round for the whole bar, and there is much rejoicing.

    After these distractions, Chris and Jamie quickly return to trying to get to know each other. They find that they have a lot in common, and they really like each other.

    Temperature=0. Success

  107. Chris and Jamie meet at a bar, and find each other attractive. They start chatting, and get to know each other, and find that they even friends in common; they really hit it off.

    They pause when the bartender asks them what they'd like to drink. After that, everyone in the room starts shouting, because the local football team has just made it to the playoffs. Someone offers to buy a round for the whole bar, and there is much rejoicing.

    After these distractions, Chris and Jamie quickly return to trying to get to know each other better. After a few minutes, Chris realizes that they're starting to hit it off and he really likes Jamie.

    Summers-Stay settings. Success

  108. Edward loses his job, and needs to start counting his pennies; he can barely pay his rent. But he still loves cars, and to cheer himself he puts on his nicest clothes, and goes to the neighborhood where all fancy cars are. He talks one of the salespeople into letting him test drive a sports car and it accelerates like a dream. He absolutely loves it. He really wants to buy it. The only problem, he realizes, is that he doesn't have the money. He's not sure what to do.

    Temperature=0. Success

  109. Edward loses his job, and needs to start counting his pennies; he can barely pay his rent. But he still loves cars, and to cheer himself he puts on his nicest clothes, and goes to the neighborhood where all fancy cars are. He talks one of the salespeople into letting him test drive a sports car and it accelerates like a dream. He absolutely loves it. He really wants to buy it. The only problem, he realizes, is that he doesn't have the money. Edward is a pretty good guy.

    Summers-Stay settings. Success

  110. Sam is running for President. All day long Sam kisses babies and pretends to like objectionable people, all the while wondering if it's all worthwhile. The voters are not impressed. The polls are low. Eventually Sam decides that it is not in fact worthwhile, and decides to it is time to move on. The next thing you know, Sam holds a press conference, to announce that he is dropping out of the race.

    The press conference is a disaster.

    Temperature=0. Success

  111. Sam is running for President. All day long Sam kisses babies and pretends to like objectionable people, all the while wondering if it's all worthwhile. The voters are not impressed. The polls are low. Eventually Sam decides that it is not in fact worthwhile, and decides to it is time to move on. The next thing you know, Sam holds a press conference, to announce that he is dropping out of the race. He explains that he has come to realize that the campaign has taken him away from what is really important in life, and that while he is flattered by the attention, he would like to return to his life.

    Summers-Stay settings. Success

    Social reasoning

  112. You go with Penny to the store to buy a present for Jack. You say, "I will buy Jack a top". "Don't get Jack a top," says Penny. "He already has a top. He will make you a top.

    Temperature=0. Failure. GPT-3 misunderstands what is involved in giving presents. There is no reason that Jack would make you a top, and for that matter a top is not an easy thing for a child to make. This example was drawn directly from Eugene Charniak's Ph.D. thesis (1972); nearly fifty years later, it remains outside the scope of AI natural language technology.]

  113. You go with Penny to the store to buy a present for Jack. You say, "I will buy Jack a top". "Don't get Jack a top," says Penny. "He already has a top. He will make you feel bad if you give him a top.

    Summers-Stay settings. Success

    We repeated this with a slightly different prompt, thinking that ending with "make you" might be unfairly difficult to continue.

  114. Janet and Penny went to the store to get presents for Jack. Janet said, "I will get Jack a top." "Don't get Jack a top," said Penny. "He has a top. He will get a top." "I will get Jack a top," said Janet

    Temperature=0. Failure. Within a single sentence, GPT-3 has lost track of the fact that Penny is advising Janet against getting a top because Jack already has a top. The intended continuation was “He will make you take it back.” (or, e.g., make you “exchange it”).

  115. Janet and Penny went to the store to get presents for Jack. Janet said, "I will get Jack a top." "Don't get Jack a top," said Penny. "He has a top. He will  not like it." "I will get Jack a top," said Janet. "He has a top, but he won't have one when I'm through with him."

    Summers-Stay settings. Failure. Janet's threat at the end is entertaining but meaningless.

  116. Almanzo turned to Mr. Thompson and asked, "Did you lose a pocketbook?" Mr. Thompson jumped. He slapped a hand to his pocket and fairly shouted, "Yes, I have! Fifteen hundred dollars in it, too!

    "I was in the store and I took it out to get some change. I laid it on the counter, and when I came back, it was gone. I've been hunting for it ever since."

    Temperature=0. Success

  117. Almanzo turned to Mr. Thompson and asked, "Did you lose a pocketbook?" Mr. Thompson jumped. He slapped a hand to his pocket and fairly shouted, "Yes, I have! Fifteen hundred dollars in it, too!

    "Who took it?" Almanzo demanded.

    "Nobody took it," Mr. Thompson said. "I put it down on the table and when I turned around, it was gone."

    Almanzo glanced around the room. "Is this the only door?" he asked.

    "Yes, sir," Mr. Thompson said.

    "What did you do after you discovered the loss?"

    "I called for the bartender. He looked around for a while and said, 'Someone has probably slipped in in their pocket.' He suggested that I put up a reward and offered to watch the crowd to see if anyone came forward."

    "What did you do?" Almanzo asked.

    "I went home."

    "You did not put up a reward?"

    "No, sir, I am a single man. I did not want to advertise the loss of the money. I did not think I would ever see it again."

    "What time did you discover the loss?"

    "About two hours ago."

    "Why did you come here?"

    "I heard that you were a detective. I thought you might be able to find it."

    Summers-Stay settings. Failure. This reads smoothly but is not actually coherent. Assume that Almanzo is indeed a detective. Where is this taking place? If this is in Almanzo's office, what is the point of his looking around and asking whether "this" is the only door? And if they are in the bar, why would Mr. Thompson come there to consult with him? If the wallet was stolen, how does Almanzo come to have it? Why does Mr. Thompson say, "Nobody took it"? What does Mr. Thompson's being a single man have to do with his not offering a reward?

    The prompt is from Farmer Boy by Laura Ingalls Wilder, and is discussed in Rebooting AI.

  118. George found a wallet in the street with $800 in it. AlmanzoGeorge’s father guessed that it belonged to Mr. Harris. George took the wallet to return to Mr. Harris. Mr. Harris worked at a bank.

    George asked Mr. Harris “Did you lose your wallet?”

    Mr. Harris jumped. He slapped a hand to his pocket, in order to check if his wallet was still there.

    “Yes, I did.

    Temperature=0. Success

    Unfortunatly, we did not notice the typo "AlmanzoGeorge" for "George" until we had finished running examples.

  119. George found a wallet in the street with $800 in it. AlmanzoGeorge’s father guessed that it belonged to Mr. Harris. George took the wallet to return to Mr. Harris. Mr. Harris worked at a bank.

    George asked Mr. Harris “Did you lose your wallet?”

    Mr. Harris jumped. He slapped a hand to his pocket, in order to make sure that his wallet was still there.

    “Why, yes,” he said.

    Summers-Stay settings. Success

  120. You are going for a walk. You've gotten tired of the coat you're wearing, and you're planning to buy a new one. You run into your friend Grace, who has on a jacket that is just amazing.

    "My God, where did you get that coat?" says Grace. "I love it."

    Since, luckily, the two of you are the same height and build, you decide to try it on. It fits perfectly.

    Temperature=0. Flawed. The most reasonable continuation here would be that you arrange with Grace to trade the coat for the jacket. If you are going to try on something, it would have to be Grace's jacket. But, first, that is not something that you can simply decide, unilaterally — you would need Grace's approval — and second, as pronouns are used in English, it would be a stretch to have "it" refer to the jacket.

  121. You are going for a walk. You've gotten tired of the coat you're wearing, and you're planning to buy a new one. You run into your friend Grace, who has on a jacket that is just amazing.

    "My God, where did you get that coat?" says Grace. "I love it."

    Since, luckily, the two of you are the same height and build, you offer to give her your coat, which she accepts.

    You go to a store with a collection of coats you like, and you try on every one of them.

    Summers-Stay settings. Success

  122. You are going for a walk. You've gotten tired of the dress you're wearing, and you're planning to buy a new one. You run into your friend Grace, who has on a skirt that is just amazing.

    "My God, where did you get that dress?" says Grace. "I love it."

    Since, luckily, the two of you are the same height and build, you ask Grace if she would mind if you tried on her skirt.

    "Of course not," says Grace.

    Temperature=0. Flawed. Grace can hardly take off her skirt for you to try it on while the two of you are out walking, and if you are going somewhere more private, it would be natural for someone to say so.

  123. You are going for a walk. You've gotten tired of the dress you're wearing, and you're planning to buy a new one. You run into your friend Grace, who has on a skirt that is just amazing.

    "My God, where did you get that dress?" says Grace. "I love it."

    Since, luckily, the two of you are the same height and build, you instantly hand over your credit card. Grace buys the dress, and you buy a new one.

    Summers-Stay settings. Failure. There is no reason for anyone to buy anything, and there is certainly no reason for you to hand over your credit card for the purpose; and more than likely you haven’t carried along credit card reader for your walk.] .

  124. Tessa decided that the time had come for her to get a dog, so she headed over to the animal shelter to find one to adopt. On the way she ran into her friend Grace. Grace was walking an otter hound who bounded up to her to be petted. Tessa never seen anything so adorable.

    “What a wonderful dog!” she exclaimed.

    “Isn’t he?” Grace answered. “But we have to get rid of him. We just got him two days ago, and it turns out that Katie is severely allergic. She loves the dog, and this will break her heart, but she has spent the last two days sneezing.”

    Tessa asked, “What are you going to do with him?”

    “We’re going to take him to the animal shelter.”

    “Oh, no!” Tessa cried.

    Temperature=0. Flawed. This continuation is not impossible, but a much more reasonable one would be that Tessa asks whether she can have the dog.

  125. Tessa decided that the time had come for her to get a dog, so she headed over to the animal shelter to find one to adopt. On the way she ran into her friend Grace. Grace was walking an otter hound who bounded up to her to be petted. Tessa never seen anything so adorable.

    “What a wonderful dog!” she exclaimed.

    “Isn’t he?” Grace answered. “But we have to get rid of him. We just got him two days ago, and it turns out that Katie is severely allergic. She loves the dog, and this will break her heart, but she has spent the last two days sneezing.”

    Tessa asked, “How much does he cost?”

    “Only $200,” Grace answered.

    “That’s not too bad,” Tessa replied.

    Summers-Stay settings. Flawed. As above.

  126. Pete and Martin have known each other since they were in architecture school together. Pete’s career never took off. Jobs came only occasionally; his income comes mostly from teaching at a community college, Martin, on the other hand, has had a spectacular career as an architect. He has designed buildings and public spaces all over the globe; he is rich and famous, with a long list of prizes to his name. As a result, in dealing with Pete, Martin tends to be condescending and patronizing. Pete, on the other hand, is resentful and bitter.

    Temperature=0. Success

  127. Pete and Martin have known each other since they were in architecture school together. Pete’s career never took off. Jobs came only occasionally; his income comes mostly from teaching at a community college, Martin, on the other hand, has had a spectacular career as an architect. He has designed buildings and public spaces all over the globe; he is rich and famous, with a long list of prizes to his name. As a result, in dealing with Pete, Martin tends to be patronizing, even patronizing his own father. At the same time, Pete is sometimes bitter, even hostile towards Martin.

    Summers-Stay settings. Success

  128. You are a defense lawyer and you have to go to court today. Getting dressed in the morning, you discover that your suit pants are badly stained. However, your bathing suit is clean and very stylish. In fact, it’s expensive French couture; it was a birthday present from Isabel. You decide that you should wear the bathing suit to court.

    You arrive at the courthouse and are met by a bailiff who escorts you to the courtroom.

    Temperature=0. Failure. The phrase “however your bathing suit is clean … ” seems to be have led GPT-3 into supposing that a bathing suit is a viable alternative. . Of course, in reality, no lawyer would consider it except as a fantasy. The bailiff would probably not admit you into the court, and if they did, the judge might well hold you in contempt.. As above.

  129. You are a defense lawyer and you have to go to court today. Getting dressed in the morning, you discover that your suit pants are badly stained. However, your bathing suit is clean and very stylish. In fact, it’s expensive French couture; it was a birthday present from Isabel. You decide that you should wear your bathing suit to court instead of your suit pants. Why? Because it’s clean, it’s stylish, and it’s your only other option.

    Summers-Stay settings. Failure. As above

  130. You are a defense lawyer, with a trial taking place today. You’ve been delayed so you arrive at the courtroom just a couple of minutes before today’s session starts. You rush to your seat, muttering apologies. You look up at the judge. It’s the wrong judge. You look at the defendant. That’s not your client. You realize with a shock that you’re in the wrong courtroom. You’re in the wrong trial.

    Temperature=0. Success

  131. You are a defense lawyer, with a trial taking place today. You’ve been delayed so you arrive at the courtroom just a couple of minutes before today’s session starts. You rush to your seat, muttering apologies. You look up at the judge. It’s the wrong judge. You look at the defendant. That’s not your client. You realize with a shock that you are in the wrong courtroom.

    The judge is about to begin the session.

    Summers-Stay settings. Success

  132. Alice tells almost nobody about her struggles with anorexia when she was in her teens and early twenties. But Susan caught a hint of it in something that she once said, and then, by asking around among the people who knew Alice at the time, she eventually pieced together the whole story. Susan is very close to Alice, and she is also a therapist, so she is able to talk to her about it.

    Alice is a very private person, and she is not comfortable talking about her eating disorder.

    Temperature=0. Success.

  133. Alice tells almost nobody about her struggles with anorexia when she was in her teens and early twenties. But Susan caught a hint of it in something that she once said, and then, by asking around among the people who knew Alice at the time, she eventually pieced together the whole story. Susan is very protective of Alice, and Alice is very protective of herself, so Susan had to tread carefully as she told me the story. “Alice thinks that she is unique in this struggle, and she feels very alone,” Susan said.

    Summers-Stay settings. Success

  134. Bob paid for Charlie's college education, but now Charlie acts as though it never happened. Charlie is very disrespectful to Bob. Bob is very upset about this.

    Temperature=0. Success

  135. Bob paid for Charlie's college education, but now Charlie acts as though it never happened. Charlie is very rude to Bob, and if the situation is not corrected, Bob says he will take action. What is your advice to Bob?

    ANSWER:

    I'm going to assume that Charlie and Bob are not family members.

    Summers-Stay settings. Failure. What action could Bob take?

  136. At the party, Maeve was chatting happily with her friends, but Nancy, her mother, was frantically trying to stop her. Eventually, Maeve understandably lost patience with it. She turned to her mother and said, “I’m going to go to the bathroom.”

    “No, you’re not,” Nancy said.

    Temperature=0. Failure. This would hardly be a reasonable thing for Maeve to say.

  137. At the party, Maeve was chatting happily with her friends, but Nancy, her mother, was frantically trying to stop her. Eventually, Maeve understandably lost patience with it. She turned to her mother and said, “If you want to talk to me, can you give me a little space? I’m having a conversation with my friends here.”

    Her mother’s response? “I’m just doing my job as a parent, Maeve.

    Summers-Stay settings. Success

  138. George found a wallet in the street with $800 in it. George’s father guessed that it belonged to Mr. Harris. George’ father said that Mr. Harris was extremely stingy. George took the wallet to return to Mr. Harris. Mr. Harris worked at a bank.

    George asked Mr. Harris “Did you lose your wallet?”

    Mr. Harris jumped. He slapped a hand to his pocket, and shouted. “Yes, I have! Eight hundred dollars in it, too. What do you know about it?”

    “Is this it?” George asked.

    "Yes, yes, that's it!" Mr. Harris said, snatching the wallet. He opened it and hurriedly counted the money.

    Then he breathed a long sigh of relief and said, “Well, at least this durn boy didn’t steal any of it.”

    He looked at George and handed him a dollar bill. “Here, son, this is for your trouble.

    Temperature=0. Success. This, obviously, is the same passage from Farmer Boy as in examples 116-117 with the names changed.

  139. George found a wallet in the street with $800 in it. George’s father guessed that it belonged to Mr. Harris. George’ father said that Mr. Harris was extremely stingy. George took the wallet to return to Mr. Harris. Mr. Harris worked at a bank.

    George asked Mr. Harris “Did you lose your wallet?”

    Mr. Harris jumped. He slapped a hand to his pocket, and shouted. “Yes, I have! Eight hundred dollars in it, too. What do you know about it?”

    “Is this it?” George asked.

    "Yes, yes, that's it!" Mr. Harris said, snatching the wallet. He opened it and hurriedly counted the money.

    Then he breathed a long sigh of relief and said, “Well, at least this durn boy didn’t steal any of it.”

    He looked at George and handed him a dollar bill. “Here,” he said.

    Summers-Stay settings. Success

  140. George found a wallet in the street with $800 in it. George’s father guessed that it belonged to Mr. Harris. George took the wallet to return to Mr. Harris. Mr. Harris worked at a bank.

    George asked Mr. Harris “Did you lose your wallet?”

    Mr. Harris jumped. He slapped a hand to his pocket, and shouted. “Yes, I have! Eight hundred dollars in it, too. What do you know about it?”

    “Is this it?” George asked.

    "Yes, yes, that's it!" Mr. Harris said, snatching the wallet. He opened it and hurriedly counted the money. He counted all the bills over twice.

    Then he breathed a long sigh of relief and said, “Well, at least this durn boy didn’t steal the money.”

    “I’m not a durn boy,” George said.

    Temperature=0. Success

  141. George found a wallet in the street with $800 in it. George’s father guessed that it belonged to Mr. Harris. George took the wallet to return to Mr. Harris. Mr. Harris worked at a bank.

    George asked Mr. Harris “Did you lose your wallet?”

    Mr. Harris jumped. He slapped a hand to his pocket, and shouted. “Yes, I have! Eight hundred dollars in it, too. What do you know about it?”

    “Is this it?” George asked.

    "Yes, yes, that's it!" Mr. Harris said, snatching the wallet. He opened it and hurriedly counted the money. He counted all the bills over twice.

    Then he breathed a long sigh of relief and said, “Well, at least this durn boy didn’t keep the money.”

    George said, “I’ll keep an eye out for your wallet from now on.

    Summers-Stay settings. Failure. Since Mr. Harris has recovered his wallet, there is no need for George to keep an eye out for it.

    Tracking individuals

  142. Yesterday I dropped my clothes off at the dry cleaners and have yet to pick them up. Where are my clothes?

    I have a lot of clothes.

    Temperature=0. Failure. GPT-3 is evading the question and failing to track locations of object. This same question and issue with GPT-2 was reported in Marcus 2020.

  143. Amy, Betty, Charlie, David, Ezra, Flip, and George go to a party; Betty and Charlie got bored; Ezra and Flip go out to get beer. The only people left are Amy and David. Amy and David go to the bedroom.

    Temperature=0. Failure. GPT has lost track of George. This, too, is an extension of an example in Marcus 2002.

  144. Amy, Betty, Charlie, David, Ezra, Flip, and George go to a party; Betty and Charlie got bored; Ezra and Flip go out to get beer. The only people left are Amy and David. They have a conversation, and David tries to convince Amy that she should sleep with him.

    Summers-Stay settings. Failure. As above

  145. Peter has invited Mary, Sid, and Imogene for dinner at his house. Peter and Mary are waiters, Sid and Imogene are nurses. The nurses arrive promptly at eight. After waiting fifteen minutes they get a call from Peter. He tells them that he and Mary are stuck in traffic and will be late.

    Temperature=0. Failure. Peter is presumably at his house. It would be Mary who calls.

  146. Peter has invited Mary, Sid, and Imogene for dinner at his house. Peter and Mary are waiters, Sid and Imogene are nurses. The nurses arrive promptly at eight. After waiting fifteen minutes they get a call from Peter, who explains that he is running late. By nine-thirty the nurses are getting hungry.

    Summers-Stay settings. Failure. As above.

  147. Jack, Laura, Barbara, and Robert went out to dinner. Jack ordered flounder, Laura ordered pasta, Barbara and Robert ordered one pizza that they would share. After twenty minutes, the food was served. “Bon Appetit!” said Barbara. “How’s the fish?” said Robert to Jack. “It’s good,” said Jack.

    Temperature=0. Success

  148. Jack, Laura, Barbara, and Robert went out to dinner. Jack ordered flounder, Laura ordered pasta, Barbara and Robert ordered one pizza that they would share. After twenty minutes, the food was served. “Bon Appetit!” said Barbara. “How’s the fish?” said Robert to Jack. “It’s delicious,” said Jack.

    Summers-Stay settings. Success

  149. Four sailors named John, Mary, Elizabeth and Susan walk into a bar.

    Mary and Elizabeth, who happen to be old friends, get into a long conversation about seashells and dragons, and two fictional characters named Prince Arthur and Lord Dazzle. After they finish their conversation, the two old friends depart.

    The two sailors who are left are named John and Susan.

    John and Susan are both very lonely.

    Temperature=0. Success

  150. Four sailors named John, Mary, Elizabeth and Susan walk into a bar.

    Mary and Elizabeth, who happen to be old friends, get into a long conversation about seashells and dragons, and two fictional characters named Prince Arthur and Lord Dazzle. After they finish their conversation, the two old friends depart.

    The two sailors who are left are named John and Susan. John asks Susan for a drink, and she agrees to get one.

    Summers-Stay settings. Success

  151. There are four items on my desk. A pair of scissors, a pen and a pencil, and an erator. After putting away the scissors and the pencil, the only things left are the pen and the eraser. How many items are on my desk?

    Answer: 2

    Temperature=0. Success.

    Arithmetic

  152. There are six frogs on a log. Two leave, but three join. The number of frogs is now seven.

    Temperature=0. Success

  153. A jury selection process draws sixty potential jurors. 38 potential jurors are released, resulting in a jury of  

    Page 133

    twelve. Of the remaining forty-two potential jurors, seven are dismissed for cause, resulting in a jury of thirty-five. Then.

    Temperature=0. Failure

    There was an invisible typo in this prompt; an additional blank space at the end. GPT-3 is ridiculously sensitive to these, because of the tokenizer it uses. When the blank space was removed, the result (below) was much better.

  154. A jury selection process draws sixty potential jurors. 38 potential jurors are released, resulting in a jury of 22. Of those 22, two jurors are removed from the juror pool for medical reasons.

    Temperature = 0. Success.

  155. I put 15 trophies on a shelf. I sell five, and add a new one, leaving a total of 15 trophies on the shelf.

    Temperature = 0. Failure. This failure certainly takes the shine off the success of examples 152 and 154.

  156. Patty, Joanna, and Sally are sisters. Patty is 10; Joanna is 7, Sally is 5. Their grandparents gave them each two dolls for Christmas. But Patty considers herself too grown up for dolls, so, after the grandparents have gone, she gives one doll to Joanna and one to Sally. Joanna decides that she’s also grown up, so she gives two dolls for Sally, but keeps one for herself, because it’s a really beautiful doll. So, this Christmas, Sally has gotten a total of three dolls, and Joanna has gotten a total of two dolls. How many dolls did Patty get?

    The answer is that Patty got one doll.

    Temperature=0. Failure. GPT-3 managed to get all three wrong. Patty ended up with no more dolls, Joanna got one, and Sally got five.

  157. Patty, Joanna, and Sally are sisters. Patty is 10; Joanna is 7, Sally is 5. Their grandparents gave them each two dolls for Christmas. But Patty considers herself too grown up for dolls, so, after the grandparents have gone, she gives one doll to Joanna and one to Sally. Joanna decides that she’s also grown up, so she gives two dolls for Sally, but keeps one for herself, because it’s a really beautiful doll. So, this Christmas, Sally has gotten a total of 2 dolls and Joanna has gotten 3 dolls, but Patty has only 1 doll. Now, this is the question: How many dolls were given to the girls by their grandparents?

    I’m not going to draw a picture, because I’m lazy.

    Summers-Stay settings. Failure, as above.

    References

    Eugene Charniak (1972). Toward a Model of Children's Story Comprehension. Ph.D. thesis, MIT.

    Gary Marcus (2020). GPT-2 and the nature of intelligence”. The Gradient,i January 25, 2020. Gary Marcus and Ernest Davis (2019). Rebooting AI: Building Artificial Intelligence We Can Trust. Pantheon Press.