SocialFusion: Addressing Social Degradation in Pre-trained Vision-Language Models

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Pretrained vision-language models (VLMs) often suffer from “social degradation”—a phenomenon where generic vision-language pretraining systematically impairs the visual encoder’s capacity to represent subtle social interaction cues—leading to negative transfer in multi-task social perception. This work proposes SocialFusion, the first framework to quantitatively uncover the mechanism behind this decay: via linear probing and gradient conflict analysis, we demonstrate a systematic decline in the social information decodability of visual backbones during pretraining. To mitigate this, SocialFusion freezes the visual encoder and introduces a lightweight fusion module that bridges frozen visual features with the language model, enabling positive transfer without updating visual parameters. Evaluated on five major social understanding benchmarks—including SocialIQ and VQAv2-Social—SocialFusion achieves state-of-the-art or competitive performance with task-specific models, significantly outperforming existing approaches.

Technology Category

Application Category

📝 Abstract

Understanding social interactions from visual cues is a fundamental challenge for a socially competent AI. While powerful pre-trained vision-language models (VLMs) have shown remarkable general capabilities, they surprisingly struggle to unify and learn multiple social perception tasks simultaneously, often exhibiting negative transfer. We identify that this negative transfer stems from a critical issue we term "social degradation," whereby the general visual-linguistic pre-training process of VLMs impairs the visual encoder's ability to represent nuanced social information. We investigate this behavior further under two lenses: decodability through linear representation probing and compatibility through gradient conflict analysis, revealing that both play a role in the degradation, especially the former, which is significantly compromised in the VLM pre-training process. To address these issues, we propose SocialFusion, a unified framework that learns a minimal connection between a frozen visual encoder and a language model. Compared with existing VLMs, it exhibits positive transfer across all five social tasks, leveraging synergies between them to enhance overall performance and achieves comparable performance to task-specific state-of-the-art models on various benchmarks. Our findings suggest that current VLM pre-training strategies may be detrimental to acquiring general social competence and highlight the need for more socially-aware training paradigms.

Problem

Research questions and friction points this paper is trying to address.

Addresses social degradation in vision-language models

Investigates negative transfer in social perception tasks

Proposes a unified framework for improved social competence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework connecting frozen visual encoder and language model

Addresses social degradation via minimal connection learning

Enhances synergy across multiple social perception tasks

🔎 Similar Papers

Uncovering Bias in Large Vision-Language Models at Scale with Counterfactuals