From Atomic to Composite: Reinforcement Learning Enables Generalization in Complementary Reasoning

📅 2025-12-01

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

It remains unclear whether reinforcement learning (RL) fundamentally enables novel reasoning skill synthesis or merely amplifies pre-existing behaviors. To probe this, we introduce “complementary reasoning”—a task requiring joint integration of parametric knowledge and external context—and decouple parametric and contextual reasoning as atomic skills. Method: We systematically evaluate models across in-distribution, compositional generalization, and zero-shot settings, comparing supervised fine-tuning (SFT) and RL under controlled skill acquisition conditions. Contribution/Results: We identify an SFT generalization paradox: while effective in-distribution, SFT fails on unseen relational compositions. Crucially, RL only synthesizes composite reasoning strategies when atomic skills are already mastered. Building on this, we propose a *two-stage training paradigm*: first, solidify atomic skills via SFT; then, leverage RL for strategic composition. Validated on a synthetic biographical dataset, this approach significantly improves out-of-distribution generalization. Our findings delineate the mechanistic boundaries of RL as a “reasoning synthesizer” and establish a viable pathway for scalable reasoning skill composition.

Technology Category

Application Category

📝 Abstract

The mechanism by which RL contributes to reasoning capabilities-whether it incentivizes the synthesis of new skills or merely amplifies existing behaviors-remains a subject of intense debate. In this work, we investigate this question through the lens of Complementary Reasoning, a complex task that requires integrating internal parametric knowledge with external contextual information. Using a controlled synthetic dataset of human biographies, we strictly decouple this ability into two atomic skills: Parametric Reasoning (relying on internal knowledge) and Contextual Reasoning (depending on external information). To rigorously assess capability boundaries, we evaluate generalization across three distinct levels of difficulty: I.I.D., Composition, and Zero-shot settings. We find that while SFT is sufficient for in-distribution performance, it struggles with O.O.D. generalization, particularly in Zero-shot settings where relational combinations are novel. Crucially, we identify the SFT Generalization Paradox: Models supervised solely on the composite task achieve near-perfect in-distribution accuracy but collapse on out-of-distribution generalization, indicating their reliance on rote memorization of path shortcuts. In contrast, we find that RL acts as a reasoning synthesizer rather than a probability amplifier. However, we uncover a strict atomic prerequisite: RL can only synthesize these complex strategies if the base model has first mastered the independent atomic skills (Parametric and Contextual) via SFT. These findings challenge the view of RL as a mere amplifier, suggesting that given sufficient atomic foundations, RL can actively synthesize complex reasoning strategies from learned primitives without explicit supervision on such complex strategies. This indicates that decoupled atomic training followed by RL offers a scalable path to generalization for complex reasoning tasks.

Problem

Research questions and friction points this paper is trying to address.

Investigates whether reinforcement learning synthesizes new skills or amplifies existing behaviors in reasoning

Examines generalization in complementary reasoning across IID, composition, and zero-shot settings

Identifies the SFT generalization paradox where models memorize shortcuts but fail at OOD generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled atomic skill training via supervised fine-tuning

Reinforcement learning synthesizes complex strategies from atomic skills

Sequential atomic-then-RL training enables out-of-distribution generalization

🔎 Similar Papers

No similar papers found.