F-LMM: Grounding Frozen Large Multimodal Models

📅 2024-06-09

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 1

career value

171K/year

🤖 AI Summary

How can precise visual grounding be endowed to large multimodal models (LMMs) without compromising their native conversational capabilities? This paper proposes a lightweight, fine-tuning-free visual grounding method that leverages intrinsic word-pixel correlations encoded in the frozen LMM’s self-attention mechanisms. A trainable multi-scale CNN maps attention weights to coarse segmentation masks, which are subsequently refined using the Segment Anything Model (SAM). Crucially, the LMM’s parameters remain entirely frozen, preserving its original question-answering, instruction-following, and chain-of-thought reasoning abilities. Our approach achieves efficient, high-quality referring expression segmentation and panoptic narrative grounding—without introducing segmentation tokens or relying on costly, high-quality grounding annotations. To our knowledge, this is the first method to simultaneously retain full LMM functionality while attaining state-of-the-art performance on both referring expression segmentation and panoptic narrative grounding benchmarks.

Technology Category

Application Category

📝 Abstract

Endowing Large Multimodal Models (LMMs) with visual grounding capability can significantly enhance AIs' understanding of the visual world and their interaction with humans. However, existing methods typically fine-tune the parameters of LMMs to learn additional segmentation tokens and overfit grounding and segmentation datasets. Such a design would inevitably cause a catastrophic diminution in the indispensable conversational capability of general AI assistants. In this paper, we comprehensively evaluate state-of-the-art grounding LMMs across a suite of multimodal question-answering benchmarks, observing drastic performance drops that indicate vanishing general knowledge comprehension and weakened instruction following ability. To address this issue, we present F-LMM -- grounding frozen off-the-shelf LMMs in human-AI conversations -- a straightforward yet effective design based on the fact that word-pixel correspondences conducive to visual grounding inherently exist in the attention mechanism of well-trained LMMs. Using only a few trainable CNN layers, we can translate word-pixel attention weights to mask logits, which a SAM-based mask refiner can further optimise. Our F-LMM neither learns special segmentation tokens nor utilises high-quality grounded instruction-tuning data, but achieves competitive performance on referring expression segmentation and panoptic narrative grounding benchmarks while completely preserving LMMs' original conversational ability. Additionally, with instruction-following ability preserved and grounding ability obtained, F-LMM can be directly applied to complex tasks like reasoning segmentation, grounded conversation generation and visual chain-of-thought reasoning. Our code can be found at https://github.com/wusize/F-LMM.

Problem

Research questions and friction points this paper is trying to address.

Enhancing visual grounding without fine-tuning LMMs

Preserving conversational ability while adding segmentation

Achieving competitive performance without specialized tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses frozen LMMs for visual grounding

Translates word-pixel attention to masks

Preserves conversational ability with CNNs

🔎 Similar Papers

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision