SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models exhibit limited performance on general reasoning tasks such as causal inference and temporal understanding, primarily due to the scarcity of high-quality, verifiable, and diverse training data. This work proposes the SUPERNOVA framework, which systematically reformulates expert-annotated natural instructions into training signals suitable for Reinforcement Learning with Verifiable Rewards (RLVR). The study further investigates the impact of source task selection, task mixing strategies, and synthetic interventions on reasoning performance. Experimental results demonstrate that this approach substantially outperforms strong baselines—including Qwen3.5—on benchmarks such as BBEH, Zebralogic, and MMLU-Pro, achieving a relative improvement of 52.8% on BBEH. These findings underscore the critical role of carefully engineered training data in enhancing general reasoning capabilities.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.
Problem

Research questions and friction points this paper is trying to address.

general reasoning
reinforcement learning
verifiable rewards
causal inference
temporal understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

SUPERNOVA
Reinforcement Learning with Verifiable Rewards
general reasoning
data curation
instruction-tuning
🔎 Similar Papers
No similar papers found.
Ashima Suvarna
Ashima Suvarna
University of California, Los Angeles
Natural Language ProcessingMachine LearningLLM Alignment
K
Kendrick Phan
University of California, Los Angeles
M
Mehrab Beikzadeh
University of California, Los Angeles
Hritik Bansal
Hritik Bansal
University of California Los Angeles | Indian Institute of Technology Delhi
Multimodal LearningLanguage Modeling
S
Saadia Gabriel
University of California, Los Angeles