SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

๐Ÿ“… 2026-02-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

212K/year
๐Ÿค– AI Summary
This work addresses the challenge of accurately evaluating an agentโ€™s ability to internalize new knowledge and apply it to future tasksโ€”a capability hindered by confounding factors such as prior knowledge interference and entangled reasoning complexity. To isolate and quantify knowledge internalization, the authors propose SE-Bench, a novel benchmark that obfuscates the NumPy library and its documentation into a pseudo-new package with randomized identifiers and prohibits access to external documentation, thereby requiring agents to complete coding tasks under closed-book conditions. The study reveals three key insights: the โ€œopen-book paradox,โ€ highlighting the necessity of closed-book training for effective knowledge compression; the โ€œRL gap,โ€ exposing limitations of standard reinforcement learning in fostering internalization; and โ€œself-play internalization,โ€ demonstrating that self-generated tasks combined with supervised fine-tuning significantly enhance internalization. SE-Bench establishes the first reliable evaluation platform for knowledge internalization in lifelong learning agents.

Technology Category

Application Category

๐Ÿ“ Abstract
True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new''knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring"Closed-Book Training"to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE-Bench.
Problem

Research questions and friction points this paper is trying to address.

self-evolution
knowledge internalization
lifelong learning
benchmarking
agent evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-evolution
knowledge internalization
closed-book training
reinforcement learning gap
self-play
๐Ÿ”Ž Similar Papers
No similar papers found.