When Is Compositional Reasoning Learnable from Verifiable Rewards?

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This study investigates whether autoregressive language models can effectively learn compositional reasoning tasks within a reinforcement learning from value-based rewards (RLVR) framework that relies solely on outcome-level feedback. Through theoretical analysis, the work introduces, for the first time, the “task advantage ratio” as a formal metric to characterize learnability conditions jointly determined by task structure and base model capabilities. The findings demonstrate that RLVR can efficiently learn target tasks when intermediate reasoning steps yield clear advantage signals; otherwise, optimization tends to converge to suboptimal solutions—a phenomenon strongly dependent on the quality of the base model. This work thus reveals the critical interplay between task design and model capacity in determining RLVR success and provides a formal understanding of learnability in compositional reasoning under sparse reward settings.

Technology Category

Application Category

📝 Abstract

The emergence of compositional reasoning in large language models through reinforcement learning with verifiable rewards (RLVR) has been a key driver of recent empirical successes. Despite this progress, it remains unclear which compositional problems are learnable in this setting using outcome-level feedback alone. In this work, we theoretically study the learnability of compositional problems in autoregressive models under RLVR training. We identify a quantity that we call the task-advantage ratio, a joint property of the compositional problem and the base model, that characterizes which tasks and compositions are learnable from outcome-level feedback. On the positive side, using this characterization, we show that compositional problems where correct intermediate steps provide a clear advantage are efficiently learnable with RLVR. We also analyze how such an advantage naturally arises in different problems. On the negative side, when the structural advantage is not present, RLVR may converge to suboptimal compositions. We prove that, in some cases, the quality of the base model determines if such an advantage exists and whether RLVR will converge to a suboptimal solution. We hope our analysis can provide a principled theoretical understanding of when and why RLVR succeeds and when it does not.

Problem

Research questions and friction points this paper is trying to address.

compositional reasoning

reinforcement learning

verifiable rewards

learnability

autoregressive models

Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional reasoning

reinforcement learning with verifiable rewards

task-advantage ratio