🤖 AI Summary
Existing domain adaptation methods fail to fully unlock the native reasoning capabilities of large reasoning models (LRMs), and direct fine-tuning on non-reflective data yields limited improvements. To address this, we propose a lightweight “self-calibration” paradigm: corrective prompts guide the model to dynamically synthesize high-quality training data *within its own reasoning traces*, enhancing native reasoning chains with minimal token-level modifications (<2.6%). Our method integrates supervised fine-tuning and reinforcement learning—domain experts identify reasoning flaws and provide concise correction signals, enabling progressive self-improvement. Built upon this framework, the 4B-parameter STORM model achieves an average accuracy of 68.9% across five mainstream optimization modeling paradigms—matching the performance of a 671B-parameter model. This significantly improves both adaptation efficiency and generalization capability for small-scale LRMs.
📝 Abstract
Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multi-step reasoning, opening new opportunities for automating optimization modeling. However, existing domain adaptation methods, originally designed for earlier instruction-tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs -- In particular, we show that direct fine-tuning on traditional extit{non-reflective} datasets leads to limited gains. To fully leverage LRMs' inherent reasoning abilities, we propose extbf{CALM} ( extit{Corrective Adaptation with Lightweight Modification}), a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks. In CALM, an expert intervener identifies reasoning flaws and provides concise corrective hints, which the LRM incorporates to produce improved reasoning trajectories. These interventions modify fewer than 2.6% of generated tokens, but generate high-quality data for soft adaptation through supervised fine-tuning. The adapted model is then further improved through reinforcement learning. Building on CALM, we develop extbf{STORM} ( extit{Smart Thinking Optimization Reasoning Model}), a 4B-parameter LRM that achieves a new state-of-the-art average accuracy of 68.9% across five popular optimization modeling benchmarks, matching the performance of a 671B LRM. These results demonstrate that dynamic, hint-based data synthesis both preserves and amplifies the native reasoning patterns of modern LRMs, offering a more effective and scalable path towards expert-level performance on challenging optimization modeling tasks.