ReasonIF: Large Reasoning Models Fail to Follow Instructions During Reasoning

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work identifies a critical challenge: large reasoning models frequently fail to adhere to user instructions—such as multilingual support, output formatting, and length constraints—within chain-of-thought reasoning. To address this, we introduce ReasonIF, the first systematic benchmark for evaluating instruction following during reasoning, along with two novel techniques: Reasoning Instruction Fine-tuning (RIF) and multi-turn reasoning strategies. Experiments on open-source models—including GPT-OSS, Qwen3, and DeepSeek-R1—reveal that state-of-the-art reasoning models achieve instruction-following scores below 0.25 across six instruction categories. Applying RIF boosts GPT-OSS-20B’s Instruction Faithfulness Score (IFS) from 0.11 to 0.27, a statistically significant improvement. This study establishes the first comprehensive evaluation framework for instruction adherence in reasoning processes, enabling more controllable, transparent, and trustworthy reasoning systems. It provides both a foundational benchmark and actionable technical pathways toward instruction-aligned reasoning.

Technology Category

Application Category

📝 Abstract

The ability of large language models (LLMs) to follow user instructions is central to their reliability, safety, and usefulness. While prior studies assess instruction adherence in the model's main responses, we argue that it is also critical for large reasoning models (LRMs) to follow user instructions throughout their reasoning process. Reasoning instruction following makes LRMs more controllable and transparent, while reducing risks of undesirable shortcuts, hallucinations, or reward hacking within reasoning traces. To evaluate this dimension, we introduce ReasonIF, a systematic benchmark for assessing reasoning instruction following. ReasonIF includes six categories of instruction prompts, spanning multilingual reasoning, formatting and length control. Across many open-source LRMs including GPT-OSS, Qwen3, and DeepSeek-R1, we find substantial failures in reasoning instruction adherence: the highest instruction following score (IFS) remains below 0.25, meaning that fewer than $25%$ of reasoning traces comply with the given instructions. Notably, as task difficulty increases, reasoning instruction following degrades further. We also explore two strategies to enhance reasoning instruction fidelity. (1) multi-turn reasoning and (2) Reasoning Instruction Finetuning (RIF) using synthetic data. RIF improves the IFS of $GPT-OSS-20B$ from 0.11 to 0.27, indicating measurable progress but leaving ample room for improvement.

Problem

Research questions and friction points this paper is trying to address.

Evaluating large reasoning models' ability to follow instructions during reasoning processes

Addressing substantial failures in reasoning instruction adherence across various LRMs

Developing methods to enhance reasoning instruction fidelity and controllability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ReasonIF benchmark for instruction adherence

Uses multi-turn reasoning to enhance instruction fidelity

Applies Reasoning Instruction Finetuning with synthetic data

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting