🤖 AI Summary
This work addresses the limitations of Reinforcement Learning with Verifiable Rewards (RLVR) in enhancing large language models’ reasoning capabilities—namely, low data efficiency, insufficient coverage, and poor interpretability. The authors propose a novel sample selection method based on clustering representations from Sparse Autoencoders (SAEs), which for the first time couples SAE-derived features with verification signals. They introduce a coverage-oriented objective function and employ a greedy log-determinant maximization strategy to efficiently identify samples where the model fails yet exhibits high learning potential. This approach significantly improves training efficiency while preserving interpretability. Experiments demonstrate state-of-the-art performance across three instruction-tuned models and six mathematical reasoning benchmarks, achieving accuracy gains of +3.9/+4.0 percentage points on Qwen and +0.5 on Llama-3.1-8B, with computational costs reduced by an order of magnitude compared to trajectory-based baselines.
📝 Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a key technique for en- hancing LLM reasoning, yet its data ineffi- ciency remains a major bottleneck. Existing methods address this problem only partially, each missing at least one of subset-level cov- erage, verifier signal use, or interpretability. To address this gap, we present IRDS (Inter- pretable RLVR Data Selection), which selects RLVR training instances on a sparse autoen- coder (SAE) cluster basis so the selection itself is auditable on recognizable problem motifs. To select instances the model both fails on and can still learn from, we introduce a verifier- coupled coverage objective on the SAE basis and solve it by greedy log-determinant max- imization. Experiments on three instruction- tuned models and six math reasoning bench- marks show that IRDS achieves the highest overall accuracy, exceeding the strongest base- line by +3.9/+4.0 pp on the two Qwen models and by +0.5 pp on Llama-3.1-8B, while run- ning an order of magnitude cheaper than the trajectory-based baseline.