Learning from Natural Language Feedback for Personalized Question Answering

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing personalized question-answering methods rely on scalar rewards in reinforcement learning, yielding weak and poorly actionable feedback that hinders optimization efficiency and personalization quality. This paper proposes the VAC framework—the first to integrate Natural Language Feedback (NLF) into personalized LLM training. VAC generates semantically rich, actionable feedback conditioned on user profiles and question context, replacing scalar rewards. It employs a RAG-augmented alternating training scheme between a policy model and a feedback model, enabling autonomous NLF generation via conditional language modeling. Evaluated on the multi-domain LaMP-QA benchmark, VAC significantly outperforms state-of-the-art methods; human evaluation confirms substantial improvements in both response relevance and personalization fidelity. Our core contribution is the principled use of structured natural language feedback to internalize personalized capabilities—establishing a novel paradigm for personalized LLM training.

Technology Category

Application Category

📝 Abstract

Personalization is crucial for enhancing both the effectiveness and user satisfaction of language technologies, particularly in information-seeking tasks like question answering. Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG), followed by reinforcement learning with scalar reward signals to teach models how to use retrieved personal context. We believe that these scalar rewards sometimes provide weak, non-instructive feedback, limiting learning efficiency and personalization quality. We introduce VAC, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF) that are generated conditioned on the user profiles and the question narratives. NLF serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference. Evaluation on the LaMP-QA benchmark that consists of three diverse domains demonstrates consistent and significant improvements over the state-of-the-art results. Human evaluations further confirm the superior quality of the generated responses. These results demonstrate that NLF provides more effective signals for optimizing personalized question answering.

Problem

Research questions and friction points this paper is trying to address.

Enhancing personalization in language models for question answering

Replacing scalar rewards with natural language feedback for better learning

Improving response quality through iterative feedback and model refinement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses natural language feedback for personalization

Replaces scalar rewards with NLF for training

Alternates feedback and policy model optimization

🔎 Similar Papers

Few-shot Personalization of LLMs with Mis-aligned Responses

2024-06-26arXiv.orgCitations: 1

Google

$207,000-$300,000 + bonus + equity + benefits.

Mountain View, CA, USA

Research Engineer, Language - Personalization, Meta Superintelligence Labs