How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study investigates the capacity of large language models (LLMs) to learn *in real time* during inference—termed test-time learning—and systematically compares this capability to human learning. Method: We propose “Semantic Games,” a saturation-resistant, strategy-reasoning–intensive evaluation paradigm, and develop an objective benchmark framework covering four experience representations: trajectory replay, summarization, key-event extraction, and contextual abstraction. We formally define and quantify test-time learning ability in LLMs, introduce human participants as a behavioral baseline, and design a contrastive dynamic evaluation protocol with human-AI collaborative alignment. Contribution/Results: Our evaluation demonstrates that LLMs exhibit measurable test-time learning, yet their learning speed is significantly slower than humans’, and performance degrades or fluctuates under extended experience accumulation. This work establishes a novel evaluation paradigm, a human-grounded benchmark, and foundational insights into the mechanisms and limitations of LLM learning.

Technology Category

Application Category

📝 Abstract

As evaluation designs of large language models may shape our trajectory toward artificial general intelligence, comprehensive and forward-looking assessment is essential. Existing benchmarks primarily assess static knowledge, while intelligence also entails the ability to rapidly learn from experience. To this end, we advocate for the evaluation of Test-time Learning, the capacity to improve performance in experience-based, reasoning-intensive tasks during test time. In this work, we propose semantic games as effective testbeds for evaluating test-time learning, due to their resistance to saturation and inherent demand for strategic reasoning. We introduce an objective evaluation framework that compares model performance under both limited and cumulative experience settings, and contains four forms of experience representation. To provide a comparative baseline, we recruit eight human participants to complete the same task. Results show that LLMs exhibit measurable test-time learning capabilities; however, their improvements are less stable under cumulative experience and progress more slowly than those observed in humans. These findings underscore the potential of LLMs as general-purpose learning machines, while also revealing a substantial intellectual gap between models and humans, irrespective of how well LLMs perform on static benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating test-time learning ability in LLMs

Comparing LLM and human learning performance

Assessing LLM improvement under varying experience conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic games evaluate test-time learning

Four experience representation forms in framework

Human comparison baseline for LLM assessment

🔎 Similar Papers

No similar papers found.

Authors to Follow