APM: Evaluating Style Personalization in LLMs with Arbitrary Preference Mappings

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
Large language models struggle to capture users’ implicit stylistic preferences—such as tone, level of detail, and formality—that are not explicitly stated, and existing evaluation methods are often confounded by semantic biases. To address this, this work proposes the Arbitrary Preference Mapping (APM) benchmark, which decouples user attributes from response style via a hidden random mapping, compelling models to infer preferences solely from dialogue history and thereby enabling unbiased evaluation. Using this framework, we evaluate three personalization approaches—retrieval-augmented generation (RAG), soft prompt tuning, and routing mechanisms—on Llama-3.1-8B and Qwen-27B. Results show that routing strategies are the most robust, RAG is effective only with strong base models, and soft prompt tuning offers no significant improvement over the baseline, highlighting the persistent challenge of personalizing to implicit stylistic preferences.
📝 Abstract
Typical LLM responses tend to follow a default style, even though users often have distinct preferences regarding tone, verbosity, and formality that they do not explicitly state in their prompts. Evaluating whether personalization methods can adapt to these implicit preferences is challenging, since users typically provide prompts rather than reference responses, style preferences are not factually verifiable, and reference-free LLM judges may conflate personalization with general response quality. To address these challenges, we introduce the Arbitrary Preference Mapping (APM) benchmark, which decouples user attributes (e.g. enthusiastic) from response principles (e.g. persuasive) via a hidden, randomized mapping $\mathbf{C}$ that maps user attributes to preferences about response traits. Because $\mathbf{C}$ carries no semantic content and is resampled across runs, models cannot exploit stereotypical associations and must infer preferences from conversation history. Using this unbiased evaluation methodology, we adapt retrieval-augmented, prompt-optimization, and routing personalization methods and evaluate them on Llama-3.1-8B and Qwen-3.5-27B. Our results show that routing is the most reliable approach, while RAG only improves with the stronger base LLM, and soft prompt optimization fails to improve significantly over a non-personalized baseline. Our extensive evaluation reveals that in this realistic setting, personalization remains challenging, but our adapted methods show promise.
Problem

Research questions and friction points this paper is trying to address.

personalization
implicit preferences
style adaptation
LLM evaluation
preference mapping
Innovation

Methods, ideas, or system contributions that make the work stand out.

Arbitrary Preference Mapping
LLM personalization
preference inference
unbiased evaluation
style adaptation