APM: Evaluating Style Personalization in LLMs with Arbitrary Preference Mappings

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Large language models struggle to capture users’ implicit stylistic preferences—such as tone, level of detail, and formality—that are not explicitly stated, and existing evaluation methods are often confounded by semantic biases. To address this, this work proposes the Arbitrary Preference Mapping (APM) benchmark, which decouples user attributes from response style via a hidden random mapping, compelling models to infer preferences solely from dialogue history and thereby enabling unbiased evaluation. Using this framework, we evaluate three personalization approaches—retrieval-augmented generation (RAG), soft prompt tuning, and routing mechanisms—on Llama-3.1-8B and Qwen-27B. Results show that routing strategies are the most robust, RAG is effective only with strong base models, and soft prompt tuning offers no significant improvement over the baseline, highlighting the persistent challenge of personalizing to implicit stylistic preferences.

📝 Abstract

Typical LLM responses tend to follow a default style, even though users often have distinct preferences regarding tone, verbosity, and formality that they do not explicitly state in their prompts. Evaluating whether personalization methods can adapt to these implicit preferences is challenging, since users typically provide prompts rather than reference responses, style preferences are not factually verifiable, and reference-free LLM judges may conflate personalization with general response quality. To address these challenges, we introduce the Arbitrary Preference Mapping (APM) benchmark, which decouples user attributes (e.g. enthusiastic) from response principles (e.g. persuasive) via a hidden, randomized mapping $\mathbf{C}$ that maps user attributes to preferences about response traits. Because $\mathbf{C}$ carries no semantic content and is resampled across runs, models cannot exploit stereotypical associations and must infer preferences from conversation history. Using this unbiased evaluation methodology, we adapt retrieval-augmented, prompt-optimization, and routing personalization methods and evaluate them on Llama-3.1-8B and Qwen-3.5-27B. Our results show that routing is the most reliable approach, while RAG only improves with the stronger base LLM, and soft prompt optimization fails to improve significantly over a non-personalized baseline. Our extensive evaluation reveals that in this realistic setting, personalization remains challenging, but our adapted methods show promise.

Problem

Research questions and friction points this paper is trying to address.

personalization

implicit preferences

style adaptation

LLM evaluation

preference mapping

Innovation

Methods, ideas, or system contributions that make the work stand out.

Arbitrary Preference Mapping

LLM personalization

preference inference