How Does RL Post-training Induce Skill Composition? A Case Study on Countdown

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study investigates how reinforcement learning (RL) post-training enhances compositional generalization in large language models (LLMs) on the Countdown task—i.e., synthesizing novel skills from known primitives. Method: We formalize the task as expression trees and conduct fine-grained structural analysis, tracking subtree reuse, structural transfer, and learnability hierarchies. Leveraging RL post-training with explicit tree-structure tracing, we quantify success-rate evolution across varying depths, balance properties, and bias patterns (e.g., right-heavy trees). Contribution/Results: We identify a structure-dependent acquisition order: shallow, balanced trees are learned first; deep or unbalanced trees lag; right-heavy structures remain persistently fragile. Crucially, standard evaluation metrics (e.g., pass@k) fail to capture this structural learning dynamic. Our findings expose intrinsic architectural constraints on compositional generalization and establish a new, interpretable paradigm for modeling skill transfer grounded in syntactic structure.

Technology Category

Application Category

📝 Abstract

While reinforcement learning (RL) successfully enhances reasoning in large language models, its role in fostering compositional generalization (the ability to synthesize novel skills from known components) is often conflated with mere length generalization. To this end, we study what RL post-training teaches about skill composition and how the structure of the composition affects the skill transfer. We focus on the Countdown task (given n numbers and a target, form an expression that evaluates to the target) and analyze model solutions as expression trees, where each subtree corresponds to a reusable subtask and thus can be viewed as a ``skill.'' Tracking tree shapes and their success rates over training, we find: (i) out-of-distribution (OOD) generalization to larger n and to unseen tree shapes, indicating compositional reuse of subtasks; (ii) a structure-dependent hierarchy of learnability -- models master shallow balanced trees (workload is balanced between subtasks) before deep unbalanced ones, with persistent fragility on right-heavy structures (even when the composition depth is the same as some left-heavy structures). Our diagnostic reveals what is learned, in what order, and where generalization fails, clarifying how RL-only post-training induces OOD generalization beyond what standard metrics such as pass@k reveal.

Problem

Research questions and friction points this paper is trying to address.

RL post-training teaches compositional skill generalization

Models master balanced trees before deep unbalanced ones

Reveals OOD generalization beyond standard metrics like pass@k

Innovation

Methods, ideas, or system contributions that make the work stand out.

RL post-training analyzes expression tree shapes

Tracks subtask reuse for compositional generalization

Reveals structure-dependent hierarchy of learnability

🔎 Similar Papers

No similar papers found.