CL-bench Life: Can Language Models Learn from Real-Life Context?

📅 2026-04-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

205K/year
🤖 AI Summary
It remains unclear whether state-of-the-art language models can effectively learn and reason within the messy, fragmented, and experience-rich contexts of real-life scenarios. To address this gap, this work introduces CL-bench Life, the first systematically curated benchmark comprising 405 context–task pairs and 5,348 fine-grained validation rules, specifically designed to capture complex situations such as multi-character dialogues, personal profiles, and behavioral traces. An evaluation of ten prominent language models on this benchmark reveals that even the best-performing model achieves a task success rate of only 19.3% (with an average of 13.8%), underscoring substantial limitations in current models’ capacity to understand and reason about authentic life contexts.
📝 Abstract
Today's AI assistants such as OpenClaw are designed to handle context effectively, making context learning an increasingly important capability for models. As these systems move beyond professional settings into everyday life, the nature of the contexts they must handle also shifts. Real-life contexts are often messy, fragmented, and deeply tied to personal and social experience, such as multi-party conversations, personal archives, and behavioral traces. Yet it remains unclear whether current frontier language models can reliably learn from such contexts and solve tasks grounded in them. To this end, we introduce CL-bench Life, a fully human-curated benchmark comprising 405 context-task pairs and 5,348 verification rubrics, covering common real-life scenarios. Solving tasks in CL-bench Life requires models to reason over complex, messy real-life contexts, calling for strong real-life context learning abilities that go far beyond those evaluated in existing benchmarks. We evaluate ten frontier LMs and find that real-life context learning remains highly challenging: even the best-performing model achieves only 19.3% task solving rate, while the average performance across models is only 13.8%. Models still struggle to reason over contexts such as messy group chat histories and fragmented behavioral records from everyday life. CL-bench Life provides a crucial testbed for advancing real-life context learning, and progress on it can enable more intelligent and reliable AI assistants in everyday life.
Problem

Research questions and friction points this paper is trying to address.

real-life context
context learning
language models
everyday life
context understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

real-life context learning
context-aware reasoning
human-curated benchmark
messy contextual data
language model evaluation
🔎 Similar Papers
No similar papers found.