GRACE: Gradient-aligned Reasoning Data Curation for Efficient Post-training

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing reasoning data filtering methods assign a single score to entire samples, overlooking the heterogeneous contributions of individual steps within reasoning chains. This work addresses this limitation by modeling reasoning trajectories as sequences of optimization events and introduces, for the first time, a fine-grained step-level scoring mechanism that operates without external reward models or human annotations. The approach leverages alignment in answer gradient directions and consistency with preceding reasoning steps, enabling efficient aggregation through representation-level gradient proxies. Remarkably, using only 5% of the MMathCoT-1M dataset, the method achieves 100.2% of the performance attainable with the full dataset, and with 20% of the data, it reaches 108.8% performance. Furthermore, the selected subsets demonstrate strong transferability across diverse model architectures.

📝 Abstract

Existing reasoning data curation pipelines score whole samples, treating every intermediate step as equally valuable. In reality, steps within a trace contribute very unevenly, and selecting reasoning data well requires assessing them individually. We present GRACE, a gradient-aligned curation method that views each reasoning trace as a sequence of optimization events and scores every step by two complementary signals: its alignment with the answer-oriented gradient direction, and its consistency with the preceding reasoning trajectory. Step-level scores are aggregated into a sample-level value for subset selection, using only the model's internal optimization signals and no external reward models or step annotations. To make this scalable, GRACE introduces a representation-level gradient proxy that estimates step-level alignment from token-level upstream signals in a single forward pass. Post-training Qwen3-VL-2B-Instruct on MMathCoT-1M, GRACE reaches 108.8% of the full-data performance with 20% of the data and retains 100.2% with only 5%, with subsets that transfer effectively across model backbones.

Problem

Research questions and friction points this paper is trying to address.

reasoning data curation

step-level evaluation

gradient alignment

post-training

data efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

gradient-aligned curation

step-level scoring

reasoning data selection