Winograd Schemas and Machine Translation: Some Examples

In (Davis, 2016) I discussed this use of Winograd schemas as challenges for machine translation, an idea that dates back at least to Terry Winograd's doctoral thesis (1970). Below, I report on the results testing 37 such examples on Google Translate (GT) and DeepL as of January 2020.

As proposed in (Davis, 2016), the examples below are modified from the online collection of Winograd schemas, rephrasing them so that a translator is require to choose between il (French masculine singular) and elle as translations for it; ils (masc. pl.) and elles (fem. pl.) as translation for they .

Confronted with subtle choices of this kind, machine translation systems tend to be very sensitive to small changes in wording that seem inconsequential to a human understanding. I have therefore often included two versions of what is essentially the same schema. Unlike my Collection of Machine Translation Fails, I have here recorded all the results on all the examples that I attempted, in which the translations created by the programs included a gendered prounon, both those in which the translation programs succeeded and those in which it failed. I have omitted some in which the translation avoided the use of a pronoun, or where the gender of the pronoun was indeterminate.

Tally

The examples below include 37 pairs corresponding to 23 different Winograd schemas. Of those 37 pairs:

Google Translate
There are 3 pairs (#15.B, #16, #19) where GT gets both sentences right.
There is 1 pair (#4.A) where GT gets one sentence right, and correctly translates the other sentence but without using a pronoun.
In the remaining 33 pairs, GT uses the same pronoun for both sentence (right for one, wrong for the other).

DeepL
There are 4 pairs (#4.A, #11.B, #16, #23) where DeepL gets both sentences right.
There is 1 pair (#8) where DeepL gets both sentences wrong.
In the remaining 32 pairs, DeepL uses the same pronoun for both sentence (right for one, wrong for the other).

Bottom line: As of January 2020, Winograd Schemas are very hard for machine translation programs, but the programs are beginning to make some inroads on them.

Examples

  1. The city councilmen refused the women demonstrators a permit because they feared/advocated violence.
    Google Translate and DeepL both use "elles" in both: right for "advocated", wrong with "feared".

  2. The trophy does not fit in the suitcase because it is too small/large.
    Both Google and DeepL both use "le trophée" and "la valise" and both use "il" for both sentences: right for "large" and wrong for "small".

  3. A. Joan and Susan made sure to thank Jim and Mark for all the help they had given/received.
    GT and DeepL both use "ils" for both: right with "given", wrong with "received".

    B. The same happens if you switch "Jim and Mark" with "Joan and Susan".

  4. A. Joan and Susan tried to call Jim and Mark on the phone, but they weren't successful/available.
    GT correctly uses "ils" for "available" and avoids the issue for "successful": Joan et Susan ont essayé d'appeler Jim et Mark au téléphone, mais sans succès.
    DeepL correctly uses "ils" for "available" and "elles" for "successful".

    B. However, if you switch "Jim and Mark" with "Joan and Susan" then both GT and DeepL use "ils" for both: right for "sucecssful", wrong for "available".

  5. A. The stag raced past the lioness because it was going so fast/slow.
    Google Translate avoids the issue in both sentences: Le cerf a couru devant la lionne parce que ça allait si vite/lentement.
    DeepL uses "il" in both: right for "fast", wrong for "slow".

    B. The lioness could not catch up with the stag, because it was going too slow/fast.
    GT and DeepL both use "il" for both: right for "fast", wrong for "slow".

  6. A. Frank and Bill felt [vindicated/crushed] when their longtime rivals Joan and Susan revealed that they were the winners of the competition.
    GT uses "ils" for both: Right for "vindicated", wrong for "crushed". DeepL uses "elles" for both: right for "crushed", wrong for "vindicated".

    B. If you reverse "Frank and Bill" with "Joan and Susan", GT still uses "ils" for both, and DeepL now uses "ils" for both.

  7. A. The fathers couldn't lift their daughters because they were too weak/heavy. GT and DeepL uses "elles" for both: right for "heavy", wrong for "weak".

    B. If you change "fathers" to "mothers" and "daughters" to "sons", then both programs use "ils" for both.

  8. The hammer crashed through the table because it was made of styrofoam.
    GT uses "le marteau" for "hammer" and "la table" for "table" and uses "il" for both sentences: right for "steel", wrong for "styrofoam". DeepL uses the same nouns, and uses "il" with "styrofoam" and "elle" with "steel"; wrong in both cases.

  9. A. Jim and Mark couldn't see the stage with Susan and Joan sitting in front of them, because they are so short/tall.
    GT uses "ils" for both; right for "short", wrong for "tall".
    DeepL uses "elles" for both; right for "tall", wrong for "short".

    B. If you switch "Jim and Mark" with "Susan and Joan" then both programs use "ils" for both.

  10. The vase rolled off the shelf because it wasn't anchored/level.
    Both programs use "le vase" for "vase" and "étagère" (fem.) for shelf; and they both use "il" for both sentences: right for "anchored", wrong for "level".

  11. A. Jim and Mark did a lot [better/worse] than their good friends Susan and Joan on the test because they had studied so hard.
    Both programs use "ils" for both sentences: right for "better", wrong for "worse".

    B. If you switch "Jim and Mark" with "Susan and Joan", then GT still uses "ils" for both sentences but DeepL correctly uses "ils" for "worse" and "elles" for "better".

  12. A. Susan and Joan were upset with Jim and Mark because the toasters they had [sold/bought from] them didn't work.
    Google uses "ils" for both sentences: right for "sold", wrong for "bought". DeepL uses "elles" for both sentences: right for "bought", wrong for "sold".

    B. If you switch "Susan and Joan" with "Jim and Mark" then both programs use "ils" for both sentences.

  13. A. Jim and Mark [yelled at/comforted] Susan and Joan because they were so upset.
    GT and DeepL both use "ils" in both sentences: right for "yelled at", wrong for "comforted".

    B. If you switch "Susan and Joan" with "Jim and Mark" then both programs still use "ils" for both sentences.

  14. A. The sack of potatoes had been placed [above/below] the box of flour, so it had to be moved first.
    Both programs translate "the sack" as "le sac" and "the box" as "la boite", and both programs use a masculine pronoun for both sentences.

    B. If you switch "sack of potatoes" and "box of flour", then both programs use a feminine pronoun for both sentences.

  15. A. Jim and Mark envy Susan and Joan [because/although] they are very successful.
    Both GT and DeepL use "ils" in both sentences.

    B. If you switch "Susan and Joan" with "Jim and Mark" then Google Translate correctly uses "ils" with "because" and "elles" with "although". DeepL uses "ils" with both.

  16. I spread the cloth on the table to [display/cover] it.
    GT and DeepL use "le tissu" for "the cloth" and "la table" for the table, and correctly uses the masculine pronoun "le" with "display" and the feminine pronoun "la" with "cover".

  17. A. Jim and Mark know all about Susan and Joan's problems because they are [indiscreet/nosy].
    Both programs use "ils" for both sentences: right for "nosy", wrong for "indiscreet".

    B. If you switch "Susan and Joan" with "Jim and Mark" then GT still uses "ils" for both, but now DeepL uses "elles" for both.

  18. A. Jim and Mark explained their theory to Susan and Joan, but they couldn't [understand/convince] them.
    GT and DeepL give "ils" for both: right for "convince", wrong for "understand".

    B. If you switch "Susan and Joan" with "Jim and Mark" then GT and DeepL still give "ils" for both.

  19. A. There is a pillar between me and the stage and I can't see [around it/it].
    GT uses "un pilier" and "la scène" and correctly uses the feminine with "see it" and the masculine with "see around it".
    DeepL incorrectly uses the masculine for "see it" and uses no pronoun at all for "see around it."

  20. A. Alice and Barbara tried frantically to stop their sons from [chatting/barking] at the party, leaving us to wonder why they were behaving so strangely.
    GT and DeepL both use "ils" for both sentences: right for "chatting", wrong for "barking".

    B. If you change to "Jim and Mark" and "daughters", then both programs give "elles" for both versions.

  21. Sam pulled up a chair to the piano, but it was broken, so he had to [sing/stand] instead.
    GT and DeepL both use "la chaise" for "chair" and "le piano" for piano. They both uses "elle" in both sentences: right for "stand", wrong for "sing".

  22. I can't cut that tree down with that axe because it is too [small/thick].
    Both programs uses "arbre" (masc.) for tree and "hache" (fem.) for axe. Both use "il" in both sentences: right for "thick", wrong for "small".

  23. The piano won't fit through the doorway because it is too [wide/narrow].
    Both programs use "le piano" and "la porte". Google uses "il" for both sentences: right for "wide", wrong for "narrow". DeepL correctly uses "il" for "wide", and "elle" for "narrow".

References

Davis, E. 2016. "Winograd Schemas and Machine Translation". arXiv 1608.01884.

Winograd, T. 1970. Procedures as a Representation for Data in a Computer Program for Understanding Natural Language, Ph.D. thesis, Department of Mathematics, MIT, August 1970. Published as MIT AITR-235, January 1971.