๐ค AI Summary
Existing direct preference learning (DPL) methods are limited to single-turn dialogue modeling and thus struggle to handle multi-turn tool-augmented reasoning and mathematical problem solving. This work introduces the first trajectory-level DPL framework tailored for multi-turn tool-enhanced reasoning. We propose novel multi-turn variants of the Direct Preference Optimization (DPO) and KL-regularized Trust Region Optimization (KTO) algorithms, explicitly modeling preferences over complete reasoning trajectories. Our approach integrates code interpreter feedback for iterative refinement and incorporates tool-augmented chain-of-thought reasoning, data augmentation using GSM8K and MATH, and high-quality trajectory annotation. Experiments demonstrate substantial improvements: Gemma-1.1-it-7B achieves +6.4% on GSM8K and +5.1% on MATH; Gemma-2-it-9B attains +2.2% and +3.5%, respectivelyโboth significantly outperforming supervised fine-tuning baselines. These results validate the efficacy of multi-turn trajectory-level preference modeling for complex reasoning tasks.
๐ Abstract
Recent studies have shown that large language models' (LLMs) mathematical problem-solving capabilities can be enhanced by integrating external tools, such as code interpreters, and employing multi-turn Chain-of-Thought (CoT) reasoning. While current methods focus on synthetic data generation and Supervised Fine-Tuning (SFT), this paper studies the complementary direct preference learning approach to further improve model performance. However, existing direct preference learning algorithms are originally designed for the single-turn chat task, and do not fully address the complexities of multi-turn reasoning and external tool integration required for tool-integrated mathematical reasoning tasks. To fill in this gap, we introduce a multi-turn direct preference learning framework, tailored for this context, that leverages feedback from code interpreters and optimizes trajectory-level preferences. This framework includes multi-turn DPO and multi-turn KTO as specific implementations. The effectiveness of our framework is validated through training of various language models using an augmented prompt set from the GSM8K and MATH datasets. Our results demonstrate substantial improvements: a supervised fine-tuned Gemma-1.1-it-7B model's performance increased from 77.5% to 83.9% on GSM8K and from 46.1% to 51.2% on MATH. Similarly, a Gemma-2-it-9B model improved from 84.1% to 86.3% on GSM8K and from 51.0% to 54.5% on MATH.