π€ AI Summary
Current task-oriented dialogue (TOD) systems generate generic responses lacking personalization with respect to user attributes such as age and emotional state. To address this, we introduce the first personalized TOD dataset incorporating user portrait images as persona representations, and propose Pictorβa novel multimodal framework that pioneers the use of user portrait images as explicit persona inputs. Pictor integrates dialogue-policy-guided multimodal prompting with external knowledge retrieval to mitigate hallucination and enhance cross-domain generalization, while jointly optimizing image understanding and text generation. Human evaluation demonstrates significant improvements in interaction naturalness and user engagement. Moreover, the model exhibits strong robustness on unseen domains, achieving a 32% gain in personalized response quality over baselines. This work establishes a foundational paradigm for image-grounded persona modeling in TOD systems.
π Abstract
Task-Oriented Dialogue (TOD) systems are designed to fulfill user requests through natural language interactions, yet existing systems often produce generic, monotonic responses that lack individuality and fail to adapt to users' personal attributes. To address this, we introduce PicPersona-TOD, a novel dataset that incorporates user images as part of the persona, enabling personalized responses tailored to user-specific factors such as age or emotional context. This is facilitated by first impressions, dialogue policy-guided prompting, and the use of external knowledge to reduce hallucinations. Human evaluations confirm that our dataset enhances user experience, with personalized responses contributing to a more engaging interaction. Additionally, we introduce a new NLG model, Pictor, which not only personalizes responses, but also demonstrates robust performance across unseen domains https://github.com/JihyunLee1/PicPersona.