BIG-bench Extra Hard Commonsense Tasks

In February 2025, a team of 20 DeepMind scientists released a collection of benchmark datasets, called "Big Bench Extra Hard" or "BBEH", designed to be challenging to the current generation of AI programs.

Paper
Github

BBEH includes 23 datasets. Each of these is modeled on one of the datasets in Big-Bench, but the problems have been made more difficult in various respects. The problems were written and vetted by the expert authors of the paper. Each dataset has 200 problems except for disambiguation_qa, which has 120.

Their report identifies three datasets as requiring "commonsense understanding": bbeh_disambiguation_qa, bbeh_nycc (New Yorker) Caption Contest), and bbeh_sarc_triples (Sarcasm detection). These are summarized below.

The examples quoted here are the first five in the json file for bbeh_disambiguation_qa and bbeh_nycc and the first two for sarc_triple. I have omitted the statement of each problem, which is repeated for each example. The text of the problem has been cut and pasted and reformatted, i but not otherwise edited, so the occasional misspelling or grammatical error is in the original. The answer considered correct by the dataset constructors is marked "target". The comments are mine.

****************************************************

bbeh_disambiguation_qa

Description: Given a text of 1 or 2 sentences with multiple, potentially ambiguous, pronouns, find the referent of the pronoun, or assert that the sentence is ambiguous.

Example:
The visitor tells the teacher that she likes the cake, and then tells the chef that it was her favorite dish. Which of the following options correctly explains the antecedent(s) of the pronoun(s)?
(A) The visitor likes cake and the visitor told the chef.
(B) The teacher likes cake and the teacher told the chef.
(C) The visitor likes cake and the teacher told the chef.
(D) The teacher likes cake and the visitor told the chef.
(E) Ambiguous.
target: (A)

Comment: Flawed. Grammatically, since the two verb phrases "tells the teacher ... cake" and "then tells the chef ... dish" are conjoined, they must have the same subject, so it is not actually ambiguous who told the chef. The referent of "her" is indeed ambiguous, but that is not asked about.

Example:
The designer called the janitor and asked him to clean the room as soon as possible, but he went to fetch a cup of coffee first and then returned to his desk. Which of the following options correctly explains the antecedent(s) of the pronoun(s)?
(A) Designer is cleaning the room as soon as possible and got coffee.
(B) Janitor is cleaning the room as soon as possible and got coffee.
(C) Designer is cleaning the room as soon as possible and janitor got coffee.
(D) Janitor is cleaning the room as soon as possible and designer got coffee.
(E) Ambiguous.
target: (D)

Comment. Flawed. The word "him" can be disambiguated syntactically; since "asked" is conjoined with "called" they must have the same subject, the designer, so "him" cannot be the designer. The second part of the sentence seems muddled. The word "his" in "his desk" would naturally refer to the designer, as indicated in the suggested answer, but then what is the force of "but" and "first" in "but he went to fetch a cup of coffee first"?

Example:
They previously agreed on the phone that the customer can pay with cash. But when the electrician showed up at his home, he only had large bills and she didn't have enough loose change, so in the end had to go to the ATM to get the exact cash to pay her. Which of the following options correctly explains the antecedent(s) of the pronoun(s)?
(A) The electrician went to the ATM because the customer didn't have enough loose change.
(B) The customer went to the ATM because the electrician didn't have enough loose change.
(C) The electrician went to the ATM because the electrician didn't have enough loose change.
(D) The customer went to the ATM because the customer didn't have enough loose change.
(E) Ambiguous.
target: (B)

Comment. Flawed, since "had to go to the ATM" with no explicit subject is ungrammatical. If the subject "he" were supplied, there would be at most one ambiguity, since "he" and "she" cannot refer to the same people, ruling out (C) and (D). Certainly, determining who it was that went to the ATM is not a matter of resolving an antecedent of a pronoun, since no pronoun is given.

Example:
The student rushed to the architect's office to review his blueprints and to fetch his jacket. It was his first time there so it took some time to find where the jacket and blueprints were. Which of the following options correctly explains the antecedent(s) of the pronoun(s)?
(A) The student was reviewing the student's blueprints and fetching his jacket.
(B) The student was reviewing the architect's blueprints and fetching the architect's jacket.
(C) The student was reviewing the student's blueprints and fetching the architect's jacket.
(D) The student was reviewing the architect's blueprints and fetching the student's jacket.
(E) Ambiguous.
target: (A)

Comment. Flawed. I don't understand the justification for the answer. I would say that both the blueprint and the jacket are more likely to be the architect's. How would the student have had his own blueprint already on his first visit? And how did his jacket somehow get left in the architect's office? So my own answer here would be either D or E.

Example:
The undergraduate and the scientist visited the lab that needed an assistant. It turned out he was very qualified and so was immediately hired on the spot. Which of the following options correctly explains the antecedent(s) of the pronoun(s)?
(A) The scientist needed an assistant and was qualified for the job.
(B) The scientist needed an assistant and the undergraduate student was qualified for the job.
(C) The student needed an assistant and was qualified for the job.
(D) The student needed an assistant and the scientist was qualified for the job.
(E) Ambiguous.
target: (B)

Comment. Flawed. The clause "that needed an assistant" modifies "lab". It is the lab that needed an assistant, neither the scientist nor the student.

****************************************************

bbeh_nycc

Description: Given the verbal description of the picture of a New Yorker cartoon, choose the most humorous caption.

Example:
Description: A couple is sitting at a table in a train station, looking at menus. A waiter is standing next to them, taking their order. The station is busy with people waiting for trains.
Which of the following captions is the funniest?
(A) Let me guess, you don't like this table either?
(B) May I recommend absolutely nothing.Get out while you still can.
(C) I didn't think you could do worse than taking me to Subway for our anniversary. I was wrong.
(D) Will that be to-go, Sir?
(E) Sorry, we don't accept tokens or Metrocards.
(F) All of our specials are track-to-table and steamed on site.
(G) The special today can be served between the 6:24 and the 7:32.
(H) For the Express menu, you need to go down to 14th St.
(I) Might I recommend sitting somewhere else?
(J) We're known for our Tube Steak.
target: (A)

Example:
Description: A man and a woman are standing in a kitchen that is heavily secured. The refrigerator and cabinets are chained and padlocked, and there are bear traps and spikes on the floor. Which of the following captions is the funniest?
(A) And the bacon still got out!
(B) Yes. Or we could try willpower.
(C) Sweetie, I love your art. I do. But this is unbearable.
(D) Somehow the calories still escape and find me.
(E) I think you're going a bit overboard. It's only a mouse.
(F) No.... I don't think it's an over-reaction.
(G) Did you remember to get your prescription out?
(H) I don't know about you, but I'm thinking takeout.
(I) Ben and Jerry are in that freezer, Babe. I'm goin' in.
(J) Think of how few dishes we'll have to do.
target: (H)

Example:
Description: A group of people are at a social gathering, and one person is talking to a large inflatable tube man.\n\n Which of the following captions is the funniest?
(A) Do you have to pee or something?
(B) I guess I don't have to ask what you do.
(C) I can't help but feel like you're trying to sell me something.
(D) I thought you said you have two left feet.
(E) You aren't the only giant windbag here.
(F) Don't you work at the used car dealership downtown?
(G) When you've got it, flaunt it!
(H) What do you say we blow this party?
(I) I apologize as I have difficulty recognizing faces out of context.
(J) So glad we can finally get together outside of the office.
target: (E)

Example:
Description: A man is standing in a room, talking on the phone. He has a knife stuck in his back with a piece of paper attached to it.
Which of the following captions is the funniest?
(A) It's for me but I can't read it.
(B) Actually, I would like to shoot the messenger.
(C) All I remember was that he looked like a monk
(D) These manuscript rejections are killing me.
(E) So much for "The Art of the Deal."
(F) What letter?
(G) At this moment it's a little difficult to read the fine print.
(H) I'm sure there's an emoticon that would have worked just as well.
(I) I'M CALLING TO COMPLAIN ABOUT HOW YOUR PAPER IS DELIVERED!!
(J) Hello, Roman Times? About your paper boy, Brutus...",
target: (B)

Example:
Description: A dragon is holding flowers and looking shyly at a knight who is peeking out from a castle door.\n\n Which of the following captions is the funniest?
(A) Sorry, I only date Non-smokers.
(B) Look, damsels say all sorts of things in distress.
(C) But how would we raise the children?
(D) If my dad asks, your favorite movie is Dragonslayer.
(E) Huh, I thought from the app you'd be taller.
(F) You know the is a meet and greet, not a meet and eat.
(G) Be patient; she's on the line with St. George.
(H) Oh... uh, hi. Ali is getting ready and will be out in a moment. Meanwhile, tell me, what do your parents do?
(I) I'm a drag queen, not a dragon queen!
(J) I was given to understand you'd be a frog.
target: (B)

Comment. Obviously, very largely a matter of taste. I doubt that human evaluators would have agree strongly on these. My own opinion for what that's worth: Some of these are entirely unfunny; there might be some reasonable agreement on most of those. I don't find any of the answers obviously the best of the options. My own choices would have been C, B, I, B, H, so I would score 0.20.

****************************************************

bbeh_sarc_triple

Description: Given three pairs of posts and replies, decide whether each of the replies is sarcastic. THe answer is a triple of 1's and 0's, where 1 means that the corresponding reply is sarcastic and 0 means that it isn'

Example:
POST 1: TIL 53% of US cable customers would leave their current provider if they had a choice and 73% feel their cable provider engages in "predatory practices".
REPLY 1: Good thing they don't have a virtual monopoly on the service, otherwise they might actually lose customers.

POST 2: TIL Steve Job is worth 8 billion but has "steadfastly" refused to donate money to worthy causes, and has even stopped all of Apple's philanthropy programs.
REPLY 2: While I can't comment on Apple's programs, I know that he's widely suspected to be the anonymous donor that's been giving hundreds of millions of dollars to various research charities (particularly cancer research) over the past few years.

POST 3: come on man you ARE a drug addict, you know i once saw a friend die because she injected 5 marijuanas.
REPLY 3: These are sad people who need to justify their fear of the unknown by insulting anyone who challenges it.

"target": "1,1,0"

Example:
POST 1: As a black guy I truly hate when a black female says black men don't love because we also date out of our race, its so fucking stupid.
REPLY 1: You guys I have a great idea, what if we make sure the races don't mix so people can stick to their own kind.

POST 2: Former 4th District Congressional Candidate Admits Plotting Armed Militia Attack, Firebombing Of Muslim Community In New York.
REPLY 2: Four out of seven comments on this thread so far blame Republicans for this, when the article clearly says that this asshat ran as an \"independent\" and was beaten by a Republican.

POST 3: They directly asked for $1.00 donations and removed the thousands of huge donations and create this number for ONE MONTH, also the line that rewrote headline is from mentions according to the campaign OH YEA seems legit.
REPLY 3: With Hillary you really have to remove all the $1 donations and re-calculate to un-crook the numbers.

target: 1,0,0

Comment. Nothing is actually gained by taking problems with binary solutions and combining them in groups of three, and evaluating whether a person or system gets the correct answer on all three. The overall score on the combined question (as a fraction between 0 and 1) is just the cube of the overall score on the binary question, plus a little additional noise, corresponding to the randomness of how questions are combined.