🤖 AI Summary
This work addresses the challenge of efficiently leveraging fine-grained human feedback for aligning large language models (LLMs). We propose a text-span-level fine-tuning method wherein annotators provide binary “like/dislike” judgments—along with rationales—on local spans of generated text. This drives left-to-right, iterative segment-wise rewriting, yielding a traceable, incremental revision chain. Adjacent revision steps are automatically paired to construct local preference pairs, which are optimized via Direct Preference Optimization (DPO) for structured alignment. Unlike conventional A/B ranking or full-sentence rewrites, our approach decouples global preference learning into localized, stepwise, and interpretable alignment subtasks. Empirical results demonstrate significant improvements in both alignment accuracy and training efficiency across multiple generation quality and user-preference matching metrics. To our knowledge, this is the first method enabling fine-grained, traceable, feedback-driven generative preference modeling.
📝 Abstract
We present a method and dataset for fine-tuning language models with preference supervision using feedback-driven improvement chains. Given a model response, an annotator provides fine-grained feedback by marking ``liked'' and ``disliked'' spans and specifying what they liked or disliked about them. The base model then rewrites the disliked spans accordingly, proceeding from left to right, forming a sequence of incremental improvements. We construct preference pairs for direct alignment from each adjacent step in the chain, enabling the model to learn from localized, targeted edits. We find that our approach outperforms direct alignment methods based on standard A/B preference ranking or full contrastive rewrites, demonstrating that structured, revision-based supervision leads to more efficient and effective preference tuning.