🤖 AI Summary
Large reasoning models (LRMs) suffer from low accuracy and poor computational efficiency in complex mathematical reasoning, primarily because their natural-language chain-of-thought reasoning cannot efficiently invoke external computational tools. To address this, we propose the first post-training framework tailored for LRM–code interpreter (CI) collaboration. Our method comprises: (1) Hint-Engineering—a novel data synthesis technique that manually designs optimal prompt positions to enhance LRM–CI interaction; and (2) a multi-stage post-training paradigm—comprising supervised fine-tuning, rejection sampling, and reinforcement learning—adapted to models of varying scales. Evaluated on five mathematical reasoning benchmarks, our approach improves accuracy by 4% for a 32B-parameter model and by 8% for a 1.5B-parameter model, while reducing generated tokens by 30% and 50%, respectively. These results demonstrate substantial gains in both reasoning accuracy and inference efficiency.
📝 Abstract
Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model's internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4% and 8% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30% fewer tokens for the 32B model and 50% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at https://github.com/ChengpengLi1003/CoRT.