🤖 AI Summary
This study addresses the limited accuracy of visual-language demographic modeling in personalized social robot interactions. We propose a multimodal pretraining framework specifically designed for fine-grained demographic recognition. Our method introduces two key innovations: (1) a demographic-aware masked image modeling (MIM) strategy that enhances model sensitivity to subtle visual cues indicative of age, gender, and related attributes; and (2) the integration of CLIP’s contrastive learning paradigm with domain-adaptive fine-tuning, jointly optimized on a newly constructed face–demographic label bimodal dataset. Experiments demonstrate that our approach significantly improves CLIP’s generalization capability for fine-grained demographic attribute recognition. It achieves state-of-the-art performance on two novel, rigorously curated benchmarks. By delivering more robust and interpretable user profiling, the method advances the feasibility of context-aware, empathetic human–robot interaction.
📝 Abstract
This paper investigates the performance of multimodal pre-trained models in user profiling tasks based on visual-linguistic demographic data. These models are critical for adapting to the needs and preferences of human users in social robotics, thereby providing personalized responses and enhancing interaction quality. First, we introduce two datasets specifically curated to represent demographic characteristics derived from user facial images. Next, we evaluate the performance of a prominent contrastive multimodal pre-trained model, CLIP, on these datasets, both in its out-of-the-box state and after fine-tuning. Initial results indicate that CLIP performs suboptimal in matching images to demographic descriptions without fine-tuning. Although fine-tuning significantly enhances its predictive capacity, the model continues to exhibit limitations in effectively generalizing subtle demographic nuances. To address this, we propose adopting a masked image modeling strategy to improve generalization and better capture subtle demographic attributes. This approach offers a pathway for enhancing demographic sensitivity in multimodal user modeling tasks.