Code-enabled language models can outperform reasoning models on diverse tasks

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Large reasoning models (RMs) exhibit strong inference capabilities but incur high training and inference costs. To address this, we propose CodeAdapt—a lightweight, fine-tuning-free approach that enhances standard instruction-tuned language models via code execution augmentation and 5-shot in-context learning. Its core innovation is the CodeAct framework, which enables multi-step, interleaved collaboration between natural language reasoning and executable code, thereby unlocking latent reasoning capabilities intrinsic to language models through interactive code execution. Experiments across eight reasoning-intensive tasks demonstrate that CodeAdapt consistently outperforms corresponding RMs: three base models achieve an average improvement of up to +22.9%; four models significantly surpass RMs on six tasks, with gains as high as +35.7%; and inference token consumption is reduced by 10%–81%. This work establishes a low-cost, high-efficiency inference paradigm and provides empirical evidence for “in-weight reinforcement learning.”

Technology Category

Application Category

📝 Abstract

Reasoning models (RMs), language models (LMs) trained with reinforcement learning to produce long-form natural language reasoning, have been remarkably successful, but they still require large amounts of computation and data to train, and can be slow and expensive to run. In this paper, we show that standard instruct LMs can already be elicited to be strong reasoners at a level comparable to or even surpassing their corresponding RMs (e.g., DeepSeek V3 vs R1) without finetuning, across diverse domains from instruction following and creative generation to mathematical reasoning. This is achieved by CodeAdapt, our simple recipe that combines the CodeAct framework, where LMs interleave natural language reasoning with code execution in a multi-step fashion, with few-shot bootstrap in-context learning from as few as five training problems. Analyzing four matched pairs of LMs and RMs, we find that CodeAdapt enables three LMs to outperform the corresponding RMs on average over eight tasks (up to 22.9%) while being 10-81% more token efficient, and delivers superior performance on six tasks when averaged over the four models (up to 35.7%). Furthermore, the code-augmented reasoning traces display rich and varied problem-solving strategies. Our findings support that (1) CodeAdapt-style learning and reasoning may be robust and domain general and (2) code-enabled LMs are cognitively grounded and powerful systems, potentially providing a strong foundation for in-weight reinforcement learning.

Problem

Research questions and friction points this paper is trying to address.

Enabling standard language models to outperform reasoning models

Reducing computational costs and improving token efficiency

Achieving robust performance across diverse reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

CodeAct framework enables interleaved reasoning with code execution

Few-shot bootstrap learning from minimal training problems

Code-augmented reasoning improves efficiency and diverse problem-solving

🔎 Similar Papers

Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance?