A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning

📅 2025-07-18

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Large language models (LLMs) struggle to effectively reflect upon and correct their reasoning across multi-turn interactions, often falling into repetitive responses due to insufficient utilization of contextual feedback. Method: We propose a lightweight multi-turn reinforcement learning paradigm centered on the “Unary Feedback as Observation” (UFO) mechanism—leveraging minimal, binary feedback signals (e.g., “try again”) to dynamically guide reasoning-path adjustments without requiring explicit error annotations or architectural modifications. The approach seamlessly integrates with existing single-turn training frameworks and incorporates multi-turn dialogue structure with fine-grained reward shaping to encourage more cautious and diverse intermediate reasoning steps. Contribution/Results: Experiments show that our method improves multi-turn reasoning accuracy by up to 14% without degrading single-turn performance, significantly enhancing LLMs’ responsiveness to sparse feedback and self-correction capability.

Technology Category

Application Category

📝 Abstract

Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback. Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards. However, we observe that models trained with existing RL paradigms often lose their ability to solve problems across multiple turns and struggle to revise answers based on contextual feedback, leading to repetitive responses. We ask: can LRMs learn to reflect their answers in a multi-turn context? In this work, we find that training models with multi-turn RL using only unary feedback (e.g., "Let's try again") after wrong answers can improve both single-turn performance and multi-turn reasoning. We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving. It can be easily applied to existing single-turn RL training setups. Experimental results show that RL training with UFO keeps single-turn performance and improves multi-turn reasoning accuracy by up to 14%, enabling language models to better react to feedback in multi-turn problem solving. To further minimize the number of turns needed for a correct answer while encouraging diverse reasoning when mistakes occur, we design reward structures that guide models to produce careful and deliberate answers in each turn. Code: https://github.com/lichengliu03/unary-feedback

Problem

Research questions and friction points this paper is trying to address.

Improves multi-turn reasoning in Large Reasoning Models

Enhances model ability to revise answers from feedback

Reduces repetitive responses with minimal unary feedback

Innovation

Methods, ideas, or system contributions that make the work stand out.

UFO uses unary feedback for multi-turn RL

Reward structures guide diverse deliberate answers

Improves multi-turn reasoning by 14% accuracy

🔎 Similar Papers

No similar papers found.