USER-VLM 360: Personalized Vision Language Models with User-aware Tuning for Social Human-Robot Interactions

πŸ“… 2025-02-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current vision-language models (VLMs) lack user-specific modeling capabilities for social human–robot interaction, failing to adapt to individual behaviors, affective states, and social relationships; moreover, conventional personalization approaches exacerbate demographic and affective biases. To address this, we propose a user-aware personalized VLM framework featuring: (i) a novel real-time user-state-driven dynamic adaptation mechanism; (ii) preference-guided fairness constraints that explicitly mitigate bias across demographic and affective dimensions; and (iii) the first 360Β° socio-affective interaction dataset annotated with demographic, emotional, and relational metadata. Our method integrates multimodal user modeling, parameter-efficient fine-tuning, and real-time cross-modal inference. Evaluated on eight benchmarks, it achieves state-of-the-art performance: +35.3% F1 on personalized visual question answering, +47.5% F1 on facial attribute understanding, βˆ’15% bias (measured via fairness metrics), and 30Γ— faster inference. The framework is successfully deployed in real time across diverse users on a Pepper robot.

Technology Category

Application Category

πŸ“ Abstract
The integration of vision-language models into robotic systems constitutes a significant advancement in enabling machines to interact with their surroundings in a more intuitive manner. While VLMs offer rich multimodal reasoning, existing approaches lack user-specific adaptability, often relying on generic interaction paradigms that fail to account for individual behavioral, contextual, or socio-emotional nuances. When customization is attempted, ethical concerns arise from unmitigated biases in user data, risking exclusion or unfair treatment. To address these dual challenges, we propose User-VLM 360{deg}, a holistic framework integrating multimodal user modeling with bias-aware optimization. Our approach features: (1) user-aware tuning that adapts interactions in real time using visual-linguistic signals; (2) bias mitigation via preference optimization; and (3) curated 360{deg} socio-emotive interaction datasets annotated with demographic, emotion, and relational metadata. Evaluations across eight benchmarks demonstrate state-of-the-art results: +35.3% F1 in personalized VQA, +47.5% F1 in facial features understanding, 15% bias reduction, and 30X speedup over baselines. Ablation studies confirm component efficacy, and deployment on the Pepper robot validates real-time adaptability across diverse users. We open-source parameter-efficient 3B/10B models and an ethical verification framework for responsible adaptation.
Problem

Research questions and friction points this paper is trying to address.

Personalized Vision Language Models
User-aware Tuning
Social Human-Robot Interactions
Innovation

Methods, ideas, or system contributions that make the work stand out.

User-aware tuning for real-time interactions
Bias mitigation via preference optimization
360Β° socio-emotive interaction datasets
πŸ”Ž Similar Papers
No similar papers found.