EvaLearn: Quantifying the Learning Capability and Efficiency of LLMs via Sequential Problem Solving

📅 2025-06-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of evaluation frameworks for assessing the dynamic learning capability and efficiency of large language models (LLMs). We introduce the first benchmark specifically designed for sequential learning—comprising 648 challenging problems organized into 182 task sequences—requiring models to accumulate experience and improve performance incrementally. We propose a novel sequential evaluation paradigm, featuring five automated, multidimensional metrics to quantify learning ability, alongside an assessment framework integrating teacher feedback, instance-level scoring, and learning trajectory analysis. Experimental results reveal a critical dissociation: strong static performance does not imply strong learning capacity; we further identify phenomena such as negative transfer for the first time. Evaluation across nine state-of-the-art models shows that Claude-3.7-Sonnet achieves significant learning gains, whereas most models fail to exhibit consistent improvement across sequences—highlighting a fundamental limitation in current LLMs’ dynamic learning capabilities.

Technology Category

Application Category

📝 Abstract
We introduce EvaLearn, a pioneering benchmark designed to evaluate large language models (LLMs) on their learning capability and efficiency in challenging tasks, a critical, yet underexplored aspect of model potential. EvaLearn contains 648 challenging problems across six task types, grouped into 182 sequences, each sequence dedicated to one task type. Diverging from most existing benchmarks that evaluate models in parallel, EvaLearn requires models to solve problems sequentially, allowing them to leverage the experience gained from previous solutions. EvaLearn provides five comprehensive automated metrics to evaluate models and quantify their learning capability and efficiency. We extensively benchmark nine frontier models and observe varied performance profiles: some models, such as Claude-3.7-sonnet, start with moderate initial performance but exhibit strong learning ability, while some models struggle to benefit from experience and may even show negative transfer. Moreover, we investigate model performance under two learning settings and find that instance-level rubrics and teacher-model feedback further facilitate model learning. Importantly, we observe that current LLMs with stronger static abilities do not show a clear advantage in learning capability across all tasks, highlighting that EvaLearn evaluates a new dimension of model performance. We hope EvaLearn provides a novel evaluation perspective for assessing LLM potential and understanding the gap between models and human capabilities, promoting the development of deeper and more dynamic evaluation approaches. All datasets, the automatic evaluation framework, and the results studied in this paper are available at the GitHub repository.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM learning capability via sequential problem solving
Assessing model efficiency across 648 diverse challenging tasks
Quantifying learning dynamics under different feedback settings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential problem-solving benchmark for LLMs
Automated metrics for learning capability
Instance-level rubrics enhance learning
🔎 Similar Papers
No similar papers found.
Shihan Dou
Shihan Dou
Fudan University
LLMsCode LMsRLAlignment
L
Lin Yan
ByteDance Seed2NLP Group, Fudan University
T
Tao Gui
Fudan University
M
Ming Zhang
Chenhao Huang
Chenhao Huang
School of Computer Science, University of Sydney
Distributed data managementDistributed systems
J
Jiayi Chen
F
Feng Chen
Shichun Liu
Shichun Liu
Fudan University
NLP
Y
Yan Liu
Chenxiao Liu
Chenxiao Liu
Peking University
C
Cheng Zhong
Zongzhang Zhang
Zongzhang Zhang
Nanjing University
Artificial IntelligenceReinforcement LearningProbabilistic PlanningMulti-Agent Systems
C
Chao Xin
C
Chengzhi Wei
Q
Qi Zhang
X
Xuanjing Huang
LAMDA, Nanjing University