🤖 AI Summary
Large language models (LLMs) struggle to effectively model users’ implicit preferences in personalized interactions, especially under long-context settings where preference understanding and persistent memory maintenance suffer from low efficiency.
Method: We propose an Implicit Personality Learning framework coupled with a scalable agent memory system that achieves superior performance using only 2K tokens of structured memory—outperforming full-context baselines with 32K tokens. Leveraging a large-scale, real-world dialogue dataset curated by us, we apply reinforcement fine-tuning to Qwen3-4B to develop a human-readable, incrementally growing memory module.
Contribution/Results: Our approach achieves 55% accuracy on implicit personalization tasks—surpassing GPT-5—while reducing input token count by 16×. This significantly enhances both long-horizon reasoning efficiency and personalization fidelity. Crucially, we empirically validate that lightweight, structured memory is highly effective for modeling implicit user preferences, establishing a novel paradigm for efficient, personalized AI systems.
📝 Abstract
Personalization is one of the next milestones in advancing AI capability and alignment. We introduce PersonaMem-v2, the state-of-the-art dataset for LLM personalization that simulates 1,000 realistic user-chatbot interactions on 300+ scenarios, 20,000+ user preferences, and 128k-token context windows, where most user preferences are implicitly revealed to reflect real-world interactions. Using this data, we investigate how reinforcement fine-tuning enables a model to improve its long-context reasoning capabilities for user understanding and personalization. We also develop a framework for training an agentic memory system, which maintains a single, human-readable memory that grows with each user over time.
In our experiments, frontier LLMs still struggle with implicit personalization, achieving only 37-48% accuracy. While they support long context windows, reasoning remains the bottleneck for implicit personalization tasks. Using reinforcement fine-tuning, we successfully train Qwen3-4B to outperforms GPT-5, reaching 53% accuracy in implicit personalization. Moreover, our agentic memory framework achieves state-of-the-art 55% accuracy while using 16x fewer input tokens, relying on a 2k-token memory instead of full 32k conversation histories. These results underscore the impact of our dataset and demonstrate agentic memory as a scalable path toward real-world personalized intelligence.