START: Self-taught Reasoner with Tools

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large reasoning models (e.g., o1, R1) rely solely on internal chain-of-thought reasoning, suffering from hallucination and low computational efficiency. To address this, we propose a long-chain tool-augmented reasoning framework that integrates code execution for self-validation, multi-path exploration, and self-debugging. We introduce Hint-infer prompting to trigger tool invocation without in-context examples and Hint-RFT—a rejection-sampling fine-tuning method enabling test-time scaling of reasoning steps. Our implementation builds upon QwQ-32B, incorporating human-crafted prompt injection, trajectory scoring and filtering, tool-call supervised fine-tuning, and a Python execution environment. Evaluated on GPQA, AMC23, AIME24/25, and LiveCodeBench, our method achieves 63.6%, 95.0%, 66.7%, 47.1%, and 47.3% accuracy, respectively—substantially outperforming strong baselines and matching state-of-the-art open- and closed-source models.

Technology Category

Application Category

📝 Abstract
Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., ``Wait, maybe using Python here is a good idea.'') during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.
Problem

Research questions and friction points this paper is trying to address.

Enhances reasoning capabilities using external tools
Reduces hallucinations and inefficiencies in large reasoning models
Introduces self-learning framework with Hint-infer and Hint-RFT techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates external tools for enhanced reasoning
Uses Hint-infer to stimulate tool utilization
Employs Hint-RFT for fine-tuning reasoning models
🔎 Similar Papers
No similar papers found.