🤖 AI Summary
Large language models exhibit limited performance on general reasoning tasks such as causal inference and temporal understanding, primarily due to the scarcity of high-quality, verifiable, and diverse training data. This work proposes the SUPERNOVA framework, which systematically reformulates expert-annotated natural instructions into training signals suitable for Reinforcement Learning with Verifiable Rewards (RLVR). The study further investigates the impact of source task selection, task mixing strategies, and synthetic interventions on reasoning performance. Experimental results demonstrate that this approach substantially outperforms strong baselines—including Qwen3.5—on benchmarks such as BBEH, Zebralogic, and MMLU-Pro, achieving a relative improvement of 52.8% on BBEH. These findings underscore the critical role of carefully engineered training data in enhancing general reasoning capabilities.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.