Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the modality-role interference (MRI) problem in multimodal role-playing agents, where generic visual features often conflict with character-specific traits. The study presents the first systematic identification and mitigation of MRI through CAVI, a training-free framework that enforces role consistency via three complementary mechanisms: role-guided token pruning (CTP) at the macro level, orthogonal feature modulation (OFM) at the micro level, and modality-adaptive role steering (MARS) during decoding. By harmonizing visual and textual modalities with the target persona without requiring additional training, CAVI significantly alleviates MRI and enhances both role fidelity and interactive quality.

📝 Abstract

The advancement of Multimodal Large Language Models (MLLMs) has expanded Role-Playing Agents (RPAs) into visually grounded environments. However, human vision is inherently subjective and identity-driven, whereas existing MLLMs extract objective, character-agnostic features for general tasks. In RPAs, this generic visual noise overpowers fragile character traits, causing Modality-Role Interference (MRI), where agents struggle to integrate visual grounding and character consistency. To address this, we introduce the training-free Character-Aware Visual Intervention (CAVI) framework, enabling agents to perceive the world through the lens of character. CAVI systematically targets MRI: macroscopically, Character-Guided Token Pruning (CTP) restricts the visual receptive field to role-relevant entities; microscopically, Orthogonal Feature Modulation (OFM) projects tokens onto a character-context subspace to extract aligned facts; and during decoding, Modality-Adaptive Role Steering (MARS) dynamically optimizes steering intensity based on visual reliance. Extensive experiments show CAVI effectively alleviates MRI, significantly enhancing character-consistent multimodal interactions.

Problem

Research questions and friction points this paper is trying to address.

Modality-Role Interference

Multimodal Role-Playing Agent

Character Consistency

Visual Grounding

Multimodal Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-Role Interference

Character-Aware Visual Intervention

Token Pruning