🤖 AI Summary
Existing code agents struggle to balance long-horizon repository state reasoning with disciplined tool usage. This work proposes a training-free parameter editing approach that injects the reasoning capabilities of a Thinking model into an Instruct model via null-space editing, while preserving the latter’s tool-following proficiency. The method innovatively constructs a pool of reasoning-editing directions from the weight differences between the two model types and integrates magnitude-threshold denoising, conservative Taylor gating, and progressive Sigmoid projection to achieve efficient and controllable capability fusion. Evaluated on Roo-Eval, SWE-bench-Verified, and Terminal-Bench v2, the approach substantially outperforms individual models—boosting pass@1 accuracy by up to 19.5%—and consistently surpasses alternative fusion strategies.
📝 Abstract
Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities are complementary but misaligned. The Instruct model is concise and tool-disciplined, whereas the Thinking model offers stronger planning and recovery behavior but often over-deliberates and degrades agent performance. We present CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-editing method that treats the Thinking-Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits that are jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired Instruct and Thinking checkpoints, CRANE delivers strong gains over either individual model while preserving Instruct-level efficiency: on Roo-Eval it achieves pass1 of 66.2% (+19.5%) for Qwen3-30B-A3B and 81.5% (+8.7%) for Qwen3-Next-80B-A3B; on SWE-bench-Verified it resolves up to 14 additional instances at both scales (122/500 and 180/500); and on Terminal-Bench v2 it improves pass1/pass5 by up to 2.3%/7.8%, reaching 7.6%/17.9% and 14.8%/30.3%, respectively, consistently outperforming alternative merging strategies across all three benchmarks.