Testing GPT-4 with Wolfram Alpha \ and Code Interpreter Plug-ins on math and science problems

Testing GPT-4 with Wolfram Alpha and Code Interpreter Plug-ins on math and science problems

Ernest Davis, New York University
and
Scott Aaronson, University of Texas at Austin (on leave at OpenAI 2022-24).

Created August 10, 2023.

This web site presents the results of describes a test of the large language model GPT-4 with plug-ins the Wolfram Alpha and Code Interpreter plug-ins on 105 expert-crafted, original problems in science and math, at the high school and college levels, carried out in June-August 2023.

All the materials linked above are are licensed under a Creative Commons Attribution 4.0 International License.

A follow-up study testing the same problems on GPT4-o1-preview was carried out September 2024.

Other Studies

Tanuj Kumar and Mikhail Kats (2023). ChatGPT-4 with Code Interpreter can be used to solve introductory college-level vector calculus and electromagnetism problems Tested ChatGPT4+CI on a set of 13 freshman-level vector calculus and electomagnetism problems, running the same problem 10 times and taking the most frequent answer. 100% success rate over these problems.

Xiaoxuan Wang et al. (2023). SciBench: Evaluating college-level scientific problem-solving abilities of large language models. Wrote their own interfaces from ChatGPT to Python and to Wolfram Alpha. Created their own benchmark collection "SciBench" of chemistry, physics, and math problems. Successes between 19% and 70% depending on area (classical mechanics was hardest, statistics was easiest.

Anjun Zhou et al. (2023). Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification" Supplement GPT4+CI with additional techniques: self-verification and verification-guded weighted majority voting. Achieved 84.3\% on the MATH data set (12,500 problems taken from high school competitions).