SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of accurately evaluating an agent’s ability to internalize new knowledge and apply it to future tasks—a capability hindered by confounding factors such as prior knowledge interference and entangled reasoning complexity. To isolate and quantify knowledge internalization, the authors propose SE-Bench, a novel benchmark that obfuscates the NumPy library and its documentation into a pseudo-new package with randomized identifiers and prohibits access to external documentation, thereby requiring agents to complete coding tasks under closed-book conditions. The study reveals three key insights: the “open-book paradox,” highlighting the necessity of closed-book training for effective knowledge compression; the “RL gap,” exposing limitations of standard reinforcement learning in fostering internalization; and “self-play internalization,” demonstrating that self-generated tasks combined with supervised fine-tuning significantly enhance internalization. SE-Bench establishes the first reliable evaluation platform for knowledge internalization in lifelong learning agents.

Technology Category

Application Category

📝 Abstract
True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new''knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring"Closed-Book Training"to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE-Bench.
Problem

Research questions and friction points this paper is trying to address.

self-evolution
knowledge internalization
lifelong learning
benchmarking
agent evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolution
knowledge internalization
closed-book training
reinforcement learning gap
self-play
🔎 Similar Papers
No similar papers found.