Teaching Language Models to Evolve with Users: Dynamic Profile Modeling for Personalized Alignment

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address the cold-start problem and weak long-term personalization of LLMs in user dialogues, this paper proposes RLPA: a Reinforcement Learning-based Personalization Architecture. RLPA employs dialogue-driven implicit user profiling and a two-level reinforcement learning framework—comprising Profile Reward (for dynamic preference inference) and Response Reward (for response quality)—to enable continual profile evolution. Integrating a user simulator and PPO optimization, RLPA performs end-to-end training on Qwen-2.5-3B-Instruct. Notably, it is the first method to jointly train dynamic user profiling and generative modeling, effectively mitigating preference conflict and drift. Experiments demonstrate that Qwen-RLPA consistently outperforms open-source baselines as well as commercial models—including Claude-3.5 and GPT-4o—on personalized dialogue tasks, achieving significant gains in long-horizon consistency, preference alignment robustness, and inference efficiency.

Technology Category

Application Category

📝 Abstract

Personalized alignment is essential for enabling large language models (LLMs) to engage effectively in user-centric dialogue. While recent prompt-based and offline optimization methods offer preliminary solutions, they fall short in cold-start scenarios and long-term personalization due to their inherently static and shallow designs. In this work, we introduce the Reinforcement Learning for Personalized Alignment (RLPA) framework, in which an LLM interacts with a simulated user model to iteratively infer and refine user profiles through dialogue. The training process is guided by a dual-level reward structure: the Profile Reward encourages accurate construction of user representations, while the Response Reward incentivizes generation of responses consistent with the inferred profile. We instantiate RLPA by fine-tuning Qwen-2.5-3B-Instruct, resulting in Qwen-RLPA, which achieves state-of-the-art performance in personalized dialogue. Empirical evaluations demonstrate that Qwen-RLPA consistently outperforms prompting and offline fine-tuning baselines, and even surpasses advanced commercial models such as Claude-3.5 and GPT-4o. Further analysis highlights Qwen-RLPA's robustness in reconciling conflicting user preferences, sustaining long-term personalization and delivering more efficient inference compared to recent reasoning-focused LLMs. These results emphasize the potential of dynamic profile inference as a more effective paradigm for building personalized dialogue systems.

Problem

Research questions and friction points this paper is trying to address.

Enabling LLMs to adapt dynamically to user preferences

Overcoming cold-start and long-term personalization limitations

Improving personalized dialogue through dynamic profile modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning for dynamic user profiles

Dual-level reward structure for personalization

Fine-tuning Qwen-2.5-3B-Instruct for dialogue

🔎 Similar Papers

No similar papers found.

OpenAI

$380K – $445K • Offers Equity

San Francisco, CA, USA

Research Engineer, Language - Personalization, Meta Superintelligence Labs