## Testing GPT-4 with Wolfram Alpha
and Code Interpreter Plug-ins
on math and science problems

Ernest Davis, New York University

and

Scott Aaronson, University
of Texas at Austin (on leave at OpenAI 2022-24).
Created August 10, 2023.

This web site presents the results of
describes a test of the large language model GPT-4 with plug-ins
the Wolfram Alpha and Code Interpreter plug-ins on 105 expert-crafted,
original problems in science and math, at the high school and college
levels, carried out in June-August 2023.

- Report
- The "Arbitrary Numerical" test set
- The "Calculation-Free" test set
- The "Motivated Numerical" test set

All the materials linked above are are licensed under a
Creative Commons Attribution 4.0 International License.

### Other Studies

Tanuj Kumar and Mikhail Kats (2023).
ChatGPT-4 with Code Interpreter can be used to solve introductory college-level vector calculus and electromagnetism problems
Tested ChatGPT4+CI on a set of 13 freshman-level vector calculus and
electomagnetism problems, running the same problem 10 times and taking the
most frequent answer. 100% success rate over these problems.
Xiaoxuan Wang et al. (2023).
SciBench: Evaluating college-level scientific problem-solving abilities
of large language models.
Wrote their own interfaces from ChatGPT to Python and to Wolfram Alpha.
Created their own benchmark collection "SciBench" of chemistry, physics,
and math problems. Successes between 19% and 70% depending on area (classical
mechanics was hardest, statistics was easiest.

Anjun Zhou et al. (2023).
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with
Code-based Self-Verification"
Supplement GPT4+CI with additional techniques: self-verification and
verification-guded weighted majority voting. Achieved 84.3\% on the MATH
data set (12,500 problems taken from high school competitions).