🤖 AI Summary
Large language models (LLMs) exhibit limited proficiency in precise computation, symbolic manipulation, and algorithmic reasoning due to the inherent nondeterminism of text-based reasoning and their inability to autonomously decide when to invoke executable code.
Method: We propose a multi-stage training framework explicitly aligned with Code Interpreter (CI) capabilities. It comprises high-diversity task-driven multi-round supervised fine-tuning (SFT), followed by GRPO/PPO-based reinforcement learning with code-masked output control.
Contribution/Results: Our work is the first to systematically demonstrate SFT’s critical role in acquiring robust code-calling capability and to elicit emergent, code-driven self-verification behavior. Applied to the Qwen-2.5 series (3B/7B/14B), the resulting R1-CI-14B achieves 64.1% accuracy across 37 benchmark tasks—surpassing GPT-4o’s text-only variant (58.6%) and approaching its Code Interpreter version (70.9%).
📝 Abstract
Despite advances in reasoning and planning of R1-like models, Large Language Models (LLMs) still struggle with tasks requiring precise computation, symbolic manipulation, optimization, and algorithmic reasoning, in which textual reasoning lacks the rigor of code execution. A key challenge is enabling LLMs to decide when to use textual reasoning versus code generation. While OpenAI trains models to invoke a Code Interpreter as needed, public research lacks guidance on aligning pre-trained LLMs to effectively leverage code and generalize across diverse tasks. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. We curate 144 reasoning and planning tasks (107 for training, 37 for testing), each with over 200 diverse questions. We fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies, investigating different answer formats, reasoning vs. non-reasoning models, cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs. Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution, highlighting the critical role of the SFT stage. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0% to 64.1%, outperforming GPT-4o (text-only: 58.6%) and approaching GPT-4o with Code Interpreter (70.9%), with the emergent self-checking behavior via code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.