Colloquium Details
The Missing Science of AI Evaluation
Speaker: Sayash Kapoor, Princeton University
Location: 370 Jay St 1201
Date: February 25, 2026, 11 a.m.
Host: Julia Stoyanovich
Synopsis:
AI evaluations inform critical decisions, from the valuations of trillion-dollar companies to policies on regulating AI. Yet, evaluation methods have failed to keep pace with deployment, creating an evaluation crisis where performance in the lab fails to predict real-world utility. In this talk, I will discuss the evaluation crisis in a high-stakes domain: AI-based science. Across dozens of fields, from medicine to political science, I find that flawed evaluation practices have led to overoptimistic claims about AI's accuracy, affecting hundreds of published papers. To address these evaluation failures, I present a consensus-based checklist that identifies common pitfalls and consolidates best practices for researchers adopting AI, and a benchmark to foster the development of AI agents that can verify scientific reproducibility. AI evaluation failures affect several other applications. Beyond science, I examine how AI agent benchmarks miss many failure modes, and present systems to identify these errors. I examine inference scaling, a recent technique to boost AI capabilities, and show that claims of improvement fail to hold under realistic conditions. Finally, I discuss how better AI evaluation can inform policymaking, drawing on my work on open foundation models and my engagement with state and federal agencies. Why does the evaluation crisis persist? The AI community has poured enormous resources into building evaluations for models, but not into investigating how models impact the world. To address the crisis, we need to build a systematic science of AI evaluation to bridge the gap between benchmark performance and real-world impact.
Speaker Bio:
Sayash Kapoor is a computer science Ph.D. candidate and a Porter Ogden Jacobus Fellow at Princeton University. He is a co-author of Al Snake Oil, one of Nature's 10 best books of 2024. His newsletter is read by 70,000 Al enthusiasts, researchers, policymakers, and journalists. His work has been published in leading scientific journals such as Science and Nature Human Behaviour, as well as conferences such as ICLR, NeurIPS, and ICML. He has written for mainstream outlets including The Wall Street Journal and WIRED, and his work has been featured in The New York Times, The Atlantic, The Washington Post, Bloomberg, and many other outlets. Kapoor has been recognized with various awards, including a best paper award at ACM FAcT, an impact recognition award at ACM CSCW, and inclusion in TIME's inaugural list of the 100 most influential people in Al.
Notes:
In-person attendance only available to those with active NYU ID cards.