F-LMM: Grounding Frozen Large Multimodal Models

📅 2024-06-09
🏛️ arXiv.org
📈 Citations: 10
Influential: 1
📄 PDF
🤖 AI Summary
How can precise visual grounding be endowed to large multimodal models (LMMs) without compromising their native conversational capabilities? This paper proposes a lightweight, fine-tuning-free visual grounding method that leverages intrinsic word-pixel correlations encoded in the frozen LMM’s self-attention mechanisms. A trainable multi-scale CNN maps attention weights to coarse segmentation masks, which are subsequently refined using the Segment Anything Model (SAM). Crucially, the LMM’s parameters remain entirely frozen, preserving its original question-answering, instruction-following, and chain-of-thought reasoning abilities. Our approach achieves efficient, high-quality referring expression segmentation and panoptic narrative grounding—without introducing segmentation tokens or relying on costly, high-quality grounding annotations. To our knowledge, this is the first method to simultaneously retain full LMM functionality while attaining state-of-the-art performance on both referring expression segmentation and panoptic narrative grounding benchmarks.

Technology Category

Application Category

📝 Abstract
Endowing Large Multimodal Models (LMMs) with visual grounding capability can significantly enhance AIs' understanding of the visual world and their interaction with humans. However, existing methods typically fine-tune the parameters of LMMs to learn additional segmentation tokens and overfit grounding and segmentation datasets. Such a design would inevitably cause a catastrophic diminution in the indispensable conversational capability of general AI assistants. In this paper, we comprehensively evaluate state-of-the-art grounding LMMs across a suite of multimodal question-answering benchmarks, observing drastic performance drops that indicate vanishing general knowledge comprehension and weakened instruction following ability. To address this issue, we present F-LMM -- grounding frozen off-the-shelf LMMs in human-AI conversations -- a straightforward yet effective design based on the fact that word-pixel correspondences conducive to visual grounding inherently exist in the attention mechanism of well-trained LMMs. Using only a few trainable CNN layers, we can translate word-pixel attention weights to mask logits, which a SAM-based mask refiner can further optimise. Our F-LMM neither learns special segmentation tokens nor utilises high-quality grounded instruction-tuning data, but achieves competitive performance on referring expression segmentation and panoptic narrative grounding benchmarks while completely preserving LMMs' original conversational ability. Additionally, with instruction-following ability preserved and grounding ability obtained, F-LMM can be directly applied to complex tasks like reasoning segmentation, grounded conversation generation and visual chain-of-thought reasoning. Our code can be found at https://github.com/wusize/F-LMM.
Problem

Research questions and friction points this paper is trying to address.

Enhancing visual grounding without fine-tuning LMMs
Preserving conversational ability while adding segmentation
Achieving competitive performance without specialized tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses frozen LMMs for visual grounding
Translates word-pixel attention to masks
Preserves conversational ability with CNNs
🔎 Similar Papers
No similar papers found.