Preference-grounded Token-level Guidance for Language Model Fine-tuning

📅 2023-06-01

🏛️ Neural Information Processing Systems

📈 Citations: 16

✨ Influential: 1

career value

173K/year

🤖 AI Summary

This work addresses the granularity mismatch between sequence-level human preferences and token-level training in language models. Methodologically, it introduces an alternating mapping-and-optimization alignment framework: (i) it extends pairwise preference learning to variable-length generation and multi-candidate settings for the first time; (ii) it designs a lightweight, data-adaptive token-level supervision objective; and (iii) it integrates imitation-learning–inspired preference modeling, token-level guided distillation, and a dual-mode loss supporting both few-shot and full-supervision regimes. Evaluated on discrete prompt generation and text summarization, the approach achieves state-of-the-art or highly competitive performance, significantly improving both preference alignment efficiency and generation quality. The core contribution lies in bridging the granularity gap via preference-guided token-level supervision—a solution grounded in theoretical soundness and validated by strong engineering practicality.

📝 Abstract

Aligning language models (LMs) with preferences is an important problem in natural language generation. A key challenge is that preferences are typically provided at the sequence level while LM training and generation both occur at the token level. There is, therefore, a granularity mismatch between the preference and the LM training losses, which may complicate the learning problem. In this paper, we address this issue by developing an alternate training process, where we iterate between grounding the sequence-level preference into token-level training guidance, and improving the LM with the learned guidance. For guidance learning, we design a framework that extends the pairwise-preference learning in imitation learning to both variable-length LM generation and the utilization of the preference among multiple generations. For LM training, based on the amount of supervised data, we present two minimalist learning objectives that utilize the learned guidance. In experiments, our method performs competitively on two distinct representative LM tasks -- discrete-prompt generation and text summarization.

Problem

Research questions and friction points this paper is trying to address.

Natural Language Generation

Human Preferences

Sequential Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Preference-Guided Learning

Adaptive Text Generation

Flexible Training Strategy

🔎 Similar Papers

No similar papers found.