CL-bench: A Benchmark for Context Learning

📅 2026-02-03
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited ability of current language models to dynamically acquire and apply new knowledge—such as domain-specific rules or empirical laws—from task contexts in complex real-world scenarios. To systematically define and evaluate this contextual learning capability, the authors introduce CL-bench, a novel benchmark comprising 500 expert-designed complex contexts, 1,899 tasks, and 31,607 validation rules, structured as context-task-validation triplets. Evaluation across ten state-of-the-art models reveals a significant performance bottleneck, with models completing only 17.2% of tasks on average; even the best-performing model, GPT-5.1, achieves just 23.7%. This benchmark fills a critical gap in assessing dynamic knowledge acquisition and application, highlighting a key limitation in contemporary language models’ contextual learning capacities.

Technology Category

Application Category

📝 Abstract
Current language models (LMs) excel at reasoning over prompts using pre-trained knowledge. However, real-world tasks are far more complex and context-dependent: models must learn from task-specific context and leverage new knowledge beyond what is learned during pre-training to reason and resolve tasks. We term this capability context learning, a crucial ability that humans naturally possess but has been largely overlooked. To this end, we introduce CL-bench, a real-world benchmark consisting of 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all crafted by experienced domain experts. Each task is designed such that the new content required to resolve it is contained within the corresponding context. Resolving tasks in CL-bench requires models to learn from the context, ranging from new domain-specific knowledge, rule systems, and complex procedures to laws derived from empirical data, all of which are absent from pre-training. This goes far beyond long-context tasks that primarily test retrieval or reading comprehension, and in-context learning tasks, where models learn simple task patterns via instructions and demonstrations. Our evaluations of ten frontier LMs find that models solve only 17.2% of tasks on average. Even the best-performing model, GPT-5.1, solves only 23.7%, revealing that LMs have yet to achieve effective context learning, which poses a critical bottleneck for tackling real-world, complex context-dependent tasks. CL-bench represents a step towards building LMs with this fundamental capability, making them more intelligent and advancing their deployment in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

context learning
language models
real-world tasks
task-specific context
new knowledge acquisition
Innovation

Methods, ideas, or system contributions that make the work stand out.

context learning
CL-bench
real-world benchmark
in-context reasoning
knowledge acquisition
🔎 Similar Papers
No similar papers found.
Shihan Dou
Shihan Dou
Fudan University
LLMsCode LMsRLAlignment
M
Ming Zhang
Hunyuan Team, Tencent
Z
Zhangyue Yin
Hunyuan Team, Tencent
Chenhao Huang
Chenhao Huang
School of Computer Science, University of Sydney
Distributed data managementDistributed systems
Y
Yujiong Shen
Hunyuan Team, Tencent
J
Junzhe Wang
Hunyuan Team, Tencent
Jiayi Chen
Jiayi Chen
Unknown affiliation
llm、vlm
Yuchen Ni
Yuchen Ni
Fudan University
LLM
J
Junjie Ye
Hunyuan Team, Tencent
C
Cheng Zhang
Hunyuan Team, Tencent
H
Huaibing Xie
Hunyuan Team, Tencent
J
Jian-hua Hu
Fudan University
Shaolei Wang
Shaolei Wang
Unknown affiliation
NLPmachine learning
Weichao Wang
Weichao Wang
UNC Charlotte
computer security
Y
Yan Xiao
Fudan University
Yiting Liu
Yiting Liu
University of California San Diego
EDAVLSI Physical DesignMachine LearningData Privacy Protection
Zenan Xu
Zenan Xu
Sun Yat-sen University
Z
Zhen‐Bo Guo
Hunyuan Team, Tencent
P
Pluto Zhou
Hunyuan Team, Tencent
T
Tao Gui
Fudan University
Zuxuan Wu
Zuxuan Wu
Fudan University
X
Xipeng Qiu
Fudan University
Qi Zhang
Qi Zhang
Fudan University
SAGINsatellite routing
X
Xuanjing Huang
Fudan University
Yu-Gang Jiang
Yu-Gang Jiang
Professor, Fudan University. IEEE & IAPR Fellow
Video AnalysisEmbodied AITrustworthy AI
Di Wang
Di Wang
​Tencent Technology Co., Ltd
Artificial Intelligence、Large Language Model、Multimodal Model、Machine Learning
S
Shunyu Yao
Hunyuan Team, Tencent