How Does Prefix Matter in Reasoning Model Tuning?

๐Ÿ“… 2026-01-04
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study investigates the impact of instructional prefixes in supervised fine-tuning (SFT) on model reasoning, safety, and factuality. By systematically varying the retention ratio of such prefixes (0%โ€“100%) in SFT data for the R1 model series and evaluating performance across adversarial safety benchmarks (WildJailbreak, StrongReject), mathematical reasoning (GSM8K), and factual consistency, the work revealsโ€”for the first timeโ€”that specific prefixes can serve as implicit alignment anchors. These prefixes significantly enhance reasoning safety and structured reasoning capabilities without requiring additional reward mechanisms. Experimental results demonstrate that prefix retention can improve Safe@1 accuracy by up to 6% and boost GSM8K performance by up to 7%, though benefits are limited or even detrimental for factuality and programming tasks.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent alignment studies commonly remove introductory boilerplate phrases from supervised fine-tuning (SFT) datasets. This work challenges that assumption. We hypothesize that safety- and reasoning-oriented prefix sentences serve as lightweight alignment signals that can guide model decoding toward safer and more coherent responses. To examine this, we fine-tune three R1 series models across three core model capabilities: reasoning (mathematics, coding), safety, and factuality, systematically varying prefix inclusion from 0% to 100%. Results show that prefix-conditioned SFT improves both safety and reasoning performance, yielding up to +6% higher Safe@1 accuracy on adversarial benchmarks (WildJailbreak, StrongReject) and +7% improvement on GSM8K reasoning. However, factuality and coding tasks show marginal or negative effects, indicating that prefix-induced narrowing of the search space benefits structured reasoning. Token-level loss analysis further reveals that prefix tokens such as"revised"and"logically"incur higher gradient magnitudes, acting as alignment anchors that stabilize reasoning trajectories. Our findings suggest that prefix conditioning offers a scalable and interpretable mechanism for improving reasoning safety, serving as an implicit form of alignment that complements traditional reward-based methods.
Problem

Research questions and friction points this paper is trying to address.

prefix conditioning
reasoning models
supervised fine-tuning
alignment
safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

prefix conditioning
reasoning alignment
supervised fine-tuning
gradient anchors
implicit alignment
๐Ÿ”Ž Similar Papers
No similar papers found.