How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the capacity of large language models (LLMs) to learn *in real time* during inference—termed test-time learning—and systematically compares this capability to human learning. Method: We propose “Semantic Games,” a saturation-resistant, strategy-reasoning–intensive evaluation paradigm, and develop an objective benchmark framework covering four experience representations: trajectory replay, summarization, key-event extraction, and contextual abstraction. We formally define and quantify test-time learning ability in LLMs, introduce human participants as a behavioral baseline, and design a contrastive dynamic evaluation protocol with human-AI collaborative alignment. Contribution/Results: Our evaluation demonstrates that LLMs exhibit measurable test-time learning, yet their learning speed is significantly slower than humans’, and performance degrades or fluctuates under extended experience accumulation. This work establishes a novel evaluation paradigm, a human-grounded benchmark, and foundational insights into the mechanisms and limitations of LLM learning.

Technology Category

Application Category

📝 Abstract
As evaluation designs of large language models may shape our trajectory toward artificial general intelligence, comprehensive and forward-looking assessment is essential. Existing benchmarks primarily assess static knowledge, while intelligence also entails the ability to rapidly learn from experience. To this end, we advocate for the evaluation of Test-time Learning, the capacity to improve performance in experience-based, reasoning-intensive tasks during test time. In this work, we propose semantic games as effective testbeds for evaluating test-time learning, due to their resistance to saturation and inherent demand for strategic reasoning. We introduce an objective evaluation framework that compares model performance under both limited and cumulative experience settings, and contains four forms of experience representation. To provide a comparative baseline, we recruit eight human participants to complete the same task. Results show that LLMs exhibit measurable test-time learning capabilities; however, their improvements are less stable under cumulative experience and progress more slowly than those observed in humans. These findings underscore the potential of LLMs as general-purpose learning machines, while also revealing a substantial intellectual gap between models and humans, irrespective of how well LLMs perform on static benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Evaluating test-time learning ability in LLMs
Comparing LLM and human learning performance
Assessing LLM improvement under varying experience conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic games evaluate test-time learning
Four experience representation forms in framework
Human comparison baseline for LLM assessment
🔎 Similar Papers
No similar papers found.
Jiayin Wang
Jiayin Wang
Tsinghua University
User ModelingPersonalization
Z
Zhiquang Guo
Tsinghua University, Beijing, China
Weizhi Ma
Weizhi Ma
Tsinghua University
LLM and AgentsRecommendationAI for Healthcare
M
Min Zhang
Tsinghua University, Beijing, China