Created August 10, 2023.
This web site presents the results of describes a test of the large language model GPT-4 with plug-ins the Wolfram Alpha and Code Interpreter plug-ins on 105 expert-crafted, original problems in science and math, at the high school and college levels, carried out in June-August 2023.
All the materials linked above are are licensed under a Creative Commons Attribution 4.0 International License.
A follow-up study testing the same problems on GPT4-o1-preview was carried out September 2024.
Xiaoxuan Wang et al. (2023). SciBench: Evaluating college-level scientific problem-solving abilities of large language models. Wrote their own interfaces from ChatGPT to Python and to Wolfram Alpha. Created their own benchmark collection "SciBench" of chemistry, physics, and math problems. Successes between 19% and 70% depending on area (classical mechanics was hardest, statistics was easiest.
Anjun Zhou et al. (2023). Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification" Supplement GPT4+CI with additional techniques: self-verification and verification-guded weighted majority voting. Achieved 84.3\% on the MATH data set (12,500 problems taken from high school competitions).