🤖 AI Summary
Existing Social-IQ models over-rely on linguistic modalities while neglecting visual context and are constrained by closed-ended multiple-choice question (MCQ) formats, hindering rigorous validation of reasoning pathways. To address these limitations, we propose VEGAS—a visually grounded, visually interpretable, generative multimodal model. Methodologically, VEGAS introduces: (1) a novel vision-relevant frame sampling strategy to enhance utilization of salient visual cues; (2) a General Instruction Fine-Tuning (GIFT) framework enabling cross-modal representation learning and joint reasoning over emotions and social traits; and (3) open-ended answer generation—replacing MCQs—to explicitly expose intermediate reasoning steps, thereby improving interpretability and trustworthiness. Experiments demonstrate that VEGAS significantly improves visual information efficiency on Social-IQ, with open-response evaluation confirming more principled reasoning. Moreover, it maintains state-of-the-art MCQ accuracy, empirically validating the efficacy of the vision-grounded explanatory paradigm.
📝 Abstract
Social Intelligence Queries (Social-IQ) serve as the primary multimodal benchmark for evaluating a model's social intelligence level. While impressive multiple-choice question(MCQ) accuracy is achieved by current solutions, increasing evidence shows that they are largely, and in some cases entirely, dependent on language modality, overlooking visual context. Additionally, the closed-set nature further prevents the exploration of whether and to what extent the reasoning path behind selection is correct. To address these limitations, we propose the Visually Explainable and Grounded Artificial Social Intelligence (VEGAS) model. As a generative multimodal model, VEGAS leverages open-ended answering to provide explainable responses, which enhances the clarity and evaluation of reasoning paths. To enable visually grounded answering, we propose a novel sampling strategy to provide the model with more relevant visual frames. We then enhance the model's interpretation of these frames through Generalist Instruction Fine-Tuning (GIFT), which aims to: i) learn multimodal-language transformations for fundamental emotional social traits, and ii) establish multimodal joint reasoning capabilities. Extensive experiments, comprising modality ablation, open-ended assessments, and supervised MCQ evaluations, consistently show that VEGAS effectively utilizes visual information in reasoning to produce correct and also credible answers. We expect this work to of fer a new perspective on Social-IQ and advance the development of human-like social AI.