Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) exhibit noisy raw attention maps and poor target alignment in training-free video grounding. Method: We propose DecAF, a decoupled attention fusion framework that reformulates video reasoning segmentation as a video question-answering task. It employs contrastive object-background fusion and complementary video-frame fusion to refine multimodal attention and generate coarse segmentation masks; these masks then serve as prompts for SAM2 to produce fine-grained segmentations. Crucially, DecAF directly converts MLLM attention—extracted via attention rollout—into high-quality segmentation without any fine-tuning, leveraging progressive fusion and SAM2’s prompt-based refinement. Contribution/Results: This is the first work to achieve training-free, high-fidelity video object segmentation solely from MLLM attention. On both referring and reasoning video object segmentation benchmarks, DecAF outperforms all existing training-free methods and matches state-of-the-art supervised approaches.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks. The code will be available at https://github.com/HYUNJS/DecAF.

Problem

Research questions and friction points this paper is trying to address.

Refining noisy attention maps for video object segmentation

Enabling training-free adaptation of MLLMs for localization tasks

Generating precise segmentation masks without model retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposed Attention Fusion refines noisy attention maps

Contrastive fusion suppresses irrelevant background activations

Training-free method converts attention to segmentation masks

🔎 Similar Papers

ViLLa: Video Reasoning Segmentation with Large Language Model