SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

📅 2026-04-09

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Large language models exhibit limited performance on general reasoning tasks such as causal inference and temporal understanding, primarily due to the scarcity of high-quality, verifiable, and diverse training data. This work proposes the SUPERNOVA framework, which systematically reformulates expert-annotated natural instructions into training signals suitable for Reinforcement Learning with Verifiable Rewards (RLVR). The study further investigates the impact of source task selection, task mixing strategies, and synthetic interventions on reasoning performance. Experimental results demonstrate that this approach substantially outperforms strong baselines—including Qwen3.5—on benchmarks such as BBEH, Zebralogic, and MMLU-Pro, achieving a relative improvement of 52.8% on BBEH. These findings underscore the critical role of carefully engineered training data in enhancing general reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.

Problem

Research questions and friction points this paper is trying to address.

general reasoning

reinforcement learning

verifiable rewards

causal inference

temporal understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

SUPERNOVA

Reinforcement Learning with Verifiable Rewards

general reasoning