π€ AI Summary
This study investigates how reinforcement learning (RL) post-training enhances compositional generalization in large language models (LLMs) on the Countdown taskβi.e., synthesizing novel skills from known primitives.
Method: We formalize the task as expression trees and conduct fine-grained structural analysis, tracking subtree reuse, structural transfer, and learnability hierarchies. Leveraging RL post-training with explicit tree-structure tracing, we quantify success-rate evolution across varying depths, balance properties, and bias patterns (e.g., right-heavy trees).
Contribution/Results: We identify a structure-dependent acquisition order: shallow, balanced trees are learned first; deep or unbalanced trees lag; right-heavy structures remain persistently fragile. Crucially, standard evaluation metrics (e.g., pass@k) fail to capture this structural learning dynamic. Our findings expose intrinsic architectural constraints on compositional generalization and establish a new, interpretable paradigm for modeling skill transfer grounded in syntactic structure.
π Abstract
While reinforcement learning (RL) successfully enhances reasoning in large language models, its role in fostering compositional generalization (the ability to synthesize novel skills from known components) is often conflated with mere length generalization. To this end, we study what RL post-training teaches about skill composition and how the structure of the composition affects the skill transfer. We focus on the Countdown task (given n numbers and a target, form an expression that evaluates to the target) and analyze model solutions as expression trees, where each subtree corresponds to a reusable subtask and thus can be viewed as a ``skill.'' Tracking tree shapes and their success rates over training, we find: (i) out-of-distribution (OOD) generalization to larger n and to unseen tree shapes, indicating compositional reuse of subtasks; (ii) a structure-dependent hierarchy of learnability -- models master shallow balanced trees (workload is balanced between subtasks) before deep unbalanced ones, with persistent fragility on right-heavy structures (even when the composition depth is the same as some left-heavy structures). Our diagnostic reveals what is learned, in what order, and where generalization fails, clarifying how RL-only post-training induces OOD generalization beyond what standard metrics such as pass@k reveal.