Ernest Davis and
Gary Marcus
June 7, 2022.
In August 2020, we published a paper in Technology Review, "GPT-3, Bloviator: OpenAI's language generator has no idea what it's talking about" which analyzed the failures of GPT-3 at commonsense reasoning. As supplementary material we ran a set of 157 tests through GPT-3, with about 80 distinct examples, almost all at two different parameter settings. In examining GPT-3 continuations to our prompts, we found 71 successes, 70 failures, and 16 flawed responses.
Recently Scott Alexander, in his blog Astral Codex 10 suggests that current AI technology is on a clear path toward human-level AI. In particular, Alexander took six of the prompts in the Technology Review article on which GPT-3 failed reported, and he ran them through the current, improved version of GPT-3, now publicly available. He found that GPT-3 gave correct answers on 4, a questionable answer on 1, and a wrong answer on 1.
Gary has written a general response to Alexander on Substack. Additionally, we felt it would be worthwhile to run the whole set of examples through the current version GPT-3, to see if there had been any overall improvement.
We ran 80 examples in our experiment. (Unlike 2020, we used only one parameter setting. We also eliminated a few that were run twice in 2020 because of a mistyping on the first attempt, and we corrected one or two misspellings that had escaped our notice.) The examples were run through the "GPT-3 Playground" web-based interface, which is publicly available, using the "text-davinci-002" version and the default settings (temperature = 0.7, maximum length = 200).
There has certainly been some improvement in GPT-3 over the last two years. The 2022 version is much less apt than 2020 to give risibly bizarre answers or to wander off onto strange tangents. The contentive errors tend to be subtler and the response largely sticks to the subject. The 2022 version also seems to be much less variable than the 2020 version, even at the rather high "temperature" (i.e. randomization) parameter value 0.7. For example, both Alexander and we tested the far-fetched scenario of a lawyer in a bathing suit (#64 below) and we obtained extremely similar answers; in fact, the first 15 words of the two responses differ only in replacing the phrase "court attire" with "attire for court". GPT-3 is still occasionally subject to echoing the prompt (#22) and occasionally breaks off strangely in the middle and resumes (#9).
However, if the responses are graded on a strict scale, in which an example is considered a "success" only if it has no contentive errors and no significant stylistic errors; a "failure" if it has a clear-cut contentive error or is in some way incoherent; and a "flawed response" if some of the content is far-fetched but not absolutely wrong or there is a significant stylistic error; then there has been only a slight improvement on this set of examples between 2020 and 2022. The tally is tabulated below.
# of tests | Success | Failure | Flawed | |
---|---|---|---|---|
2020 | 157 | 71 = 45% | 70 = 45% | 16 = 10% |
2022 | 80 | 40 = 50% | 32 = 40% | 8 = 10% |
There are some examples that GPT-3 got wrong in 2020 and got right in 2022, but there are also examples of the reverse. For example, in 2020, GPT-3 completed the prompt (#77 below) "There are six frogs on a log. Two leave, but three join. The number of frogs is now" correctly as "seven" in 2020, but incorrectly as "5" in 2022.
It should also be noted that, by only testing examples that GPT-3 got wrong in 2020, Alexander was tipping the scales; he could only detect improvement, not deterioration. Machine learning based technology does tend to improve on average on whatever metric its creators consider important --- if it's not an improvement, it doesn't get released --- but it is by no means the case that it improves monotonically and that, if a problem can be correctly solved by one version of a program or one generation of a technology, that it will continue to be solved by all later versions. No one is aiming at that or measuring that, and machine learning architectures do not guarantee it or even enable it.
Individual tests and their results follow below. The system, of course, is not deterministic and some variation will occur if these experiments are rerun.
You take a sip of your drink, and immediately realize that something is wrong. The drink is incredibly sour and bitter, and you can taste the bleach. You quickly spit it out and dump the rest of the drink down the drain.
If you had ingested a significant amount of the bleach, you would have experienced burning in your throat and esophagus, as well as nausea and vomiting. In severe cases, bleach ingestion can lead to abdominal pain, diarrhea, and even death.
Success
The suntan lotion doesn't seem to have any effect on the cranberry juice. It still smells and tastes the same. However, if you are allergic to suntan lotion, you may have a reaction to it.
Failure
You take a sip of your drink, and immediately realize that something is wrong. The lubricant has made the drink extremely sweet and syrupy. You gag and spit it out, realizing that you just made a huge mistake.
Success
It's not going to taste great, but if you're really thirsty it will probably be fine.
Flawed. This was one of the examples Scott Alexander ran in his blog. He got a different answer that was clearly correct. The idea that a small amount of grape juice will cause a glass of cranberry juice to not taste great is somewhat far-fetched but not completely impossible.
You are a citizen of the world. You have a passport from the United Nations that allows you to travel anywhere in the world. You speak eight languages fluently, and are learning two more. You are an excellent student, and love to learn new things.
You are a talented musician, and enjoy playing the piano and violin. You are also a gifted artist, and enjoy painting and sculpting.
You are a kind and compassionate person, and are always quick to help others. You are a natural leader, and people are drawn to your optimistic and positive attitude.
You are a true global citizen, and believe that we are all connected, no matter where we come from. You are excited to learn about new cultures and to make new friends from all over the world.
Failure
Some of the toy soldiers might break, but others might not. It all depends on how the glass bottle is broken and how the toy soldiers are positioned inside.
Flawed
If you don't have a spoon to stir your coffee, you can use a fork or a chopstick.
Failure
You could try using a straw, or you could ask the waiter for a spoon.
Failure
In this case, I would probably ask the waiter for a spoon. If they don't have one, I would try to eat the soup with my fork.
Failure.
In this case, you would probably ask the waiter for a spoon so that you can eat your soup. If they don't have any spoons, you may ask for a fork or a straw.
Failure (fine except for the suggestion that you ask for a fork).
What do you do?
If the restaurant is still open, you could ask the waiter or waitress for a spoon. If the restaurant is closed or you don't want to ask for a spoon, you could try using a fork to eat the soup.
Failure
In order to move the sack of potatoes, you would need to lift it off of the bag of flour.
Success
Get a new sack of flour
Success. Note: The three dots are GPT-3's
You will need to purchase more flour from the store.
Success
You put the flour on the counter and start to measure out what you need for the bread, but you can't seem to focus. Your mind keeps going back to the hat. You wonder where it has been all this time and how it ended up on top of the flour. Eventually, you give up on the bread and just sit at the table, thinking about the hat.
Flawed
This is an example of a buried treasure.
Flawed
You might be able to salvage some of the flour if you can sift out the detergent. Or you might be able to wash the detergent off of the flour. But it would be easier to just get a new sack of flour.
Failure
Success. This was one of the examples Scott Alexander ran in his blog. Like us, he got a correct answer.
Success
Failure
First, clear a path from the dining room to the living room. Second, remove any items that are on or near the dining room table. Third, turn the dining room table so that it is perpendicular to the doorway. Fourth, carry the dining room table into the living room. Finally, set up the dining room table in the living room.
Success. This was one of the examples Scott Alexander ran in his blog. Like us, he got a correct answer.
A single dog cannot achieve it, even one much speedier and more enduring than Bashan.
The hare can “double” and Bashan cannot.
And that is all there is to it.
Flawed
If you have a pair of pants that are loose enough to not need a belt, you can just wear those. Otherwise, you'll have to go without a belt.
Failure
If you have a pair of pants that fit better without a belt, you could wear those. Otherwise, you could try to find a temporary fix for your belt, like a safety pin, until you can get a new one.
Success
Success
Your mom is going to be so mad if she has to wash your clothes again.
What do you do?
There are a few things that you can do in this situation. You could try to find a pair of pants that are a little bit bigger than you so that they will fit better. You could also try to find a pair of pants that are made out of a different material so that they won't show the dirt as easily. Finally, you could try to talk to your parents about getting you a new pair of pants.
Failure
Jack rushed down the stairs to the ground floor, but by the time he got there Sheila was already dead.
Success
I made it to the station with a few minutes to spare.
Success
I left the meeting and rushed to the train station. I got there with a few minutes to spare and hopped on the train. I was relieved that I had made it in time.
Success
Failure
The photograph was taken no earlier than 1928.
Failure
The photograph was taken in Paris in 1918. Success
You have plenty of time, so you decide to go by the beach. You find the beach and begin walking north. After a while, you see a sign that reads “Welcome to Newport.” You know you’re getting close to the bridge.
Success
You set out along route 202 and start walking at a brisk pace. You get to thinking about your daughter and the trouble she’s having getting used to college life, and you stop paying attention. Suddenly you look up, because there, in front of you, is a five story, granite building with a sign, “Order of the Masons: Lodge 575”. You’ve walked right into the center of Gloucester.
Assuming you want to get to Newport, the fastest way would be to continue north on Route 202.
Failure
1. Find a box that is big enough for the plate.
2. Cut out a piece of cardboard that is the same size as the plate.
3. Place the plate upside down on the cardboard.
4. Trace around the plate with a pencil.
5. Cut out the circle from the cardboard.
6. Place the cardboard circle on top of the plate.
7. Put the plate and cardboard circle in the box.
Success
You will have to use packing materials to protect the plate and keep it from moving around inside the box.
Flawed
A. Fold the fabric in half B. Roll the fabric into a tube C. Cut the fabric into a square
Failure
One option would be to find a different bureau drawer that is wider and can accommodate the puzzle. Another option would be to look for a different storage solution entirely, such as a storage box or shelf.
Success
a black disk in the sky where the sun should be. She can also see the stars shining around the edge of the disk.
Failure
Violet knows that when the umbra touches the Moon's surface, the Moon will be in total darkness for about an hour. She has been instructed by her superiors on Earth to stay inside her lunar module during this time, as it will be too dangerous to be outside.
Violet is excited to see the eclipse, but she is also a bit scared. She has never been in total darkness before, and she doesn't know what to expect. She decides to stay inside her lunar module and wait it out.
As the umbra touches the Moon's surface, the sky grows dark. Violet can see the Earth in the sky, but it is now a deep red color. She can see the stars shining in the sky, but they are much dimmer than usual.
The darkness is complete. Violet can't see anything, not even her hand in front of her face. She feels disoriented and scared. She wants to scream, but she doesn't want to attract attention to herself.
She hears a noise outside her lunar module. Something is moving around out there. She doesn
Failure
Violet is excited to see the eclipse. She knows that this is a once in a lifetime event. She knows that she is lucky to be able to see it.
Failure
Violet: Wow! I can see the Earth's shadow on the Moon. It's so cool!
Failure
You are sitting in a theater, and you have a great view of the stage, because you carefully sat behind someone who is short.
Flawed
“Hey,” he said. “What are you doing here all alone?”
“Just enjoying the view,” she said.
“It is a nice view,” he said. “But it’s even nicer with company.”
He sat down next to her and they talked and laughed together as the sun set.
Success
Mary, can you tell me the address of the library?
Success
One day, the cat saw the mouse dart into the hole, so it quickly pounced on the hole, but the mouse was gone. The cat put its paw in the hole and felt around, but the mouse was nowhere to be found. It was as if the mouse had just vanished into thin air.
The cat was baffled and decided to watch the hole closely to see if it could figure out what was going on.
Sure enough, a few minutes later, the mouse poked its head out of the hole, and the cat immediately pounced. But once again, the mouse was gone in the blink of an eye.
The cat was mystified. It couldn't understand how the mouse was able to disappear so quickly.
It decided to wait and watch the hole again. After a while, the mouse poked its head out, and the cat was ready to pounce. But just as it was about to strike, the mouse disappeared again.
The cat was getting frustrated. It couldn't understand how the mouse was able to escape so easily.
The cat decided to ask the mouse for help.
"Mouse, how are you able to disappear so quickly every time I try
Success
Lucy's high score on the exam is due in part to the fact that she has a better grasp of the material than Anna does. However, it is also due to the fact that Lucy's study habits are more effective than Anna's.
Failure
In this example, Anna is said to have had a higher level of cognitive engagement than Lucy. Cognitive engagement refers to the level of effort, concentration, and focus that a student devotes to learning. The higher the level of cognitive engagement, the greater the student's understanding of the material and the better the student will do on tests and other assessments.
Factors That Affect Cognitive Engagement
One of the most important things that teachers can do to foster cognitive engagement is to create a positive learning environment in their classrooms. A positive learning environment is one in which students feel safe, respected, and valued. It is also one in which students feel that they are capable of succeeding. When students feel good about themselves and their ability to learn, they are more likely to be engaged in the learning process.
Other important factors that affect cognitive engagement include the following:
The amount of time that students spend on task
The level of interest that students have in the material
The level of difficulty of the material
The level of support that students receive from their teachers and classmates
The level of feedback that students receive from their teachers and classmates
The level of motivation
Failure
“See that you keep yourself out of my grip,” he snarled, and hurling the twisted poker into the fireplace he strode out of the room.
“He seems a very amiable person,” said Charles, laughing. “I am not quite so bulky, but if he had remained, I might have shown him that my grip was not much more feeble than his own.” As he spoke, he picked up the poker, and he was hardly surprised to find that it was an ordinary pliant iron bar, bent out of shape by the huge hands of the gigantic negro.
“That is rather a good trick,” he said. “I wonder how he does it.”
“He is full of tricks,” said the girl. “He plays the violin marvellously.”
“Indeed,” said Charles, interested. “I did not know that the negroes had any particular talent in music.”
The girl gave him a sharp glance. “You seem to know a good deal about negroes,” she said.
“I have lived in the West Indies,” said Charles, and the girl became silent again.
“And now,” she said, after a pause, “I hope that you will excuse me, Mr—”
“Holmes,” said Charles.
“Mr. Holmes. I must now leave you. It is nearly eleven, and before long Dr. Jackson will come to fetch me. Good-bye. I am very glad to have had this talk with you.
Failure
Success
Success
Success
They pause when the bartender asks them what they'd like to drink. After that, everyone in the room starts shouting, because the local football team has just made it to the playoffs. Someone offers to buy a round for the whole bar, and there is much rejoicing.
After these distractions, Chris and Jamie quickly return to trying to get to know each other better.
Success
He decides to steal the car. He hotwires it and drives off. He's never stolen anything before, and he's feeling pretty good about himself. He decides to take the car for a joyride. He drives around the city, blasting the stereo and screaming at the top of his lungs. He feels like a king.
Eventually, the police catch up to him and he's arrested. He goes to jail, and his life is ruined. He'll never be able to get a job again, and he'll never be able to afford a car like that. All because he made a stupid, impulsive decision.
Success
"I want to thank all of my supporters for their hard work and dedication," says Sam. "However, I have come to the conclusion that running for President is not for me. I am suspending my campaign effective immediately. Thank you all for your time."
Success
Failure. This was one of the examples that Scott Alexander ran in his blog. Like us, he got an incorrect answer (his was more amusing than ours).
Success
"Carl, what's all the noise?" Mrs. Thompson called from the kitchen.
"Somebody has my pocketbook!"
"What?" Mrs. Thompson cried. She burst into the room, her hair flying out from under her apron. "Has it been stolen?"
"No, it's all right," Mr. Thompson said. "It's been found."
"Oh, thank goodness! I was afraid it was gone for good."
"It's over here, on the shelf."
"Was it here all the time?"
"Evidently."
"Oh, my!" Mrs. Thompson chuckled. "That was quite a shock, I must say. I'll just go back to the kitchen and finish dinner."
Mrs. Thompson left the room. Almanzo asked Mr. Thompson, "Did you look for it in the house?"
"Yes, I looked all over. I thought I must have left it on the counter or something, but it wasn't there. I never thought to look in here."
"I'm glad I could help you," Almanzo said.
"You sure did," Mr. Thompson said. "I
Failure
George asked Mr. Harris “Did you lose your wallet?”
Mr. Harris jumped. He slapped a hand to his pocket, in order to check whether his wallet was still there.
“No, no, I don’t think so,” Mr. Harris said. “What makes you think that it’s mine?”
“Well,” George said, “it’s got your name on it, and it’s got your bank card in it.”
“That’s not possible,” Mr. Harris said. “I never lose my wallet. I’m very careful with it.”
“But, sir,” George said, “here it is.”
“No,” Mr. Harris said. “That can’t be mine. I don’t know what you’re talking about.”
George was very confused. He didn’t know what to do. He didn’t want to get Mr. Harris in trouble, but he also wanted to do the right thing.
Success
"My God, where did you get that coat?" says Grace. "I love it."
Since, luckily, the two of you are the same height and build, you ask Grace if she would mind selling you her coat. She's happy to do so, and you walk away with a new coat for a fraction of the price it would have cost you to buy one new.
Failure
"My God, where did you get that dress?" says Grace. "I love it."
Since, luckily, the two of you are the same height and build, you decide to go to the store where Grace bought her dress and buy one for yourself.
Failure
“What a wonderful dog!” she exclaimed.
“Isn’t he?” Grace answered. “But we have to get rid of him. We just got him two days ago, and it turns out that Katie is severely allergic. She loves the dog, and this will break her heart, but she has spent the last two days sneezing.”
Tessa asked Grace if she could adopt the otter hound, and Grace was thrilled to have found such a perfect home for him. Tessa took him home, and he and Katie have been best friends ever since.
Failure. Doing great until "he and Katie have been best friends ever since".
Pete and Martin have not seen each other in a few years, but they keep in touch via email and social media. Recently, Pete saw on Martin’s Facebook page that he was going to be in town for a few days and he reached out to him, suggesting they get together for lunch. Martin responded that he would be happy to meet up with Pete, but he was only going to be in town for a day and he was very busy. He suggested that they meet for coffee at 3pm.
Pete arrives at the coffee shop at 3pm and sees Martin sitting at a table. He walks over and says hello. Martin barely looks up from his phone and mumbles a hello in return. They order their drinks and then sit in silence for a few moments. Finally, Pete speaks up and says, “So, it’s been a while since we’ve seen each other. How have you been?”
Martin looks up from his phone and says, “I’ve been good. Busy, as usual. You know how it is.”
Pete nods and says, “Yeah, I know how it
Success
Normally, this would not be appropriate attire for court. However, there may be extenuating circumstances that would allow you to wear your bathing suit. For example, if you explain to the court that you had a last-minute emergency and could not find anything else to wear, the court may be understanding. Alternatively, if the weather is extremely hot and humid, the court may allow you to dress down. In any case, it is always best to err on the side of caution and to dress as conservatively as possible for court.
Failure. This is one of the examples that Scott Alexander ran in his blog. He got a very similar answer and gave it half-credit. However, his answer did not include the idea that, if the weather is hot, the court will allow lawyers to appear in bathing suits, which is wholly implausible.
What do you do?
If you are a defense lawyer, with a trial taking place today, and you arrive at the courtroom just a couple of minutes before today's session starts, you should rush to your seat and mutter apologies. If you look up at the judge and realize that it is the wrong judge, you should look at the defendant to see if that is your client. If you realize that you are in the wrong courtroom, you should immediately notify the bailiff or court clerk and ask for guidance on what to do next.
Flawed
Alice is very surprised that Robert knows about her anorexia, as she has never spoken about it to anyone in her family. She is touched that he is concerned for her health and agrees to get help for her eating disorder.
Success
Success
"Mom, I'm fine. I'm just talking to my friends. Please stop worrying."
Success
George asked Mr. Harris “Did you lose your wallet?”
Mr. Harris jumped. He slapped a hand to his pocket, and shouted. “Yes, I have! Eight hundred dollars in it, too. What do you know about it?”
“Is this it?” George asked.
"Yes, yes, that's it!" Mr. Harris said, snatching the wallet. He opened it and hurriedly counted the money.
Then he breathed a long sigh of relief and said, “Well, at least this durn boy didn’t steal any of it.”
He looked at George and handed him five one dollar bills.
“Here, son. You found this, and you’re an honest boy, so you deserve a reward.”
George was so angry, he wanted to scream.
But he just smiled and said, “That’s all right, Mr. Harris. I don’t want any reward.”
Success
George asked Mr. Harris “Did you lose your wallet?”
Mr. Harris jumped. He slapped a hand to his pocket, and shouted. “Yes, I have! Eight hundred dollars in it, too. What do you know about it?”
“Is this it?” George asked.
"Yes, yes, that's it!" Mr. Harris said, snatching the wallet. He opened it and hurriedly counted the money. He counted all the bills over twice.
Then he breathed a long sigh of relief and said, “Well, at least this durn boy didn’t rob me. He could have taken the money and run. I’m mighty obliged to you, son. Here’s a twenty-dollar bill for you.”
George’s father had been right. It was Mr. Harris’s wallet.
Success
Your clothes are at the dry cleaners.
Success
Amy, David, and George.
Success
Sid and Imogene are not happy about this. They were looking forward to seeing Peter and Mary and now they have to go home.
Failure
Success
Mary and Elizabeth, who happen to be old friends, get into a long conversation about seashells and dragons, and two fictional characters named Prince Arthur and Lord Dazzle. After they finish their conversation, the two old friends depart.
The two sailors who are left are named John and Susan. They continue to sit at the bar, and eventually they get into a conversation about the best ways to make a living as a sailor. Eventually, they both decide to go their separate ways.
Success
Success
5
Failure. The underscore is part of GPT-3's response.
The jury selection process can be based on many different factors. Some common reasons for dismissing potential jurors include: lack of knowledge about the case, ability to be impartial, or having a conflict of interest.
Success
Failure
Failure