LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

📅 2025-07-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large multimodal models (LMMs) suffer from inaccurate segmentation and semantic hallucination in segmentation-and-understanding tasks, primarily due to weak visual understanding and insufficient fine-grained perception. To address this, we propose a Semantic Enhancement and Local Visual Coupling (SELVC) framework: (1) the Semantic-Enhanced Feature Embedding (SEFE) module fuses high-level semantics with pixel-wise features to improve segmentation fidelity; (2) an Interleaved Local Visual Coupling (ILVC) mechanism provides fine-grained supervision via autoregressive generation of local region descriptions. We further discover—empirically and for the first time—that the semantic representation strength of the `<seg>` token correlates positively with segmentation accuracy. Based on this framework, we develop an end-to-end trainable model and introduce AttrEval, the first benchmark dataset designed specifically for attribute-inference evaluation. Experiments demonstrate state-of-the-art performance across multiple segmentation-and-understanding benchmarks, substantial reduction in hallucination, and enhanced joint reasoning over object attributes and spatial locations.

Technology Category

Application Category

📝 Abstract
While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the <seg> token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks. Code will be available at https://github.com/echo840/LIRA.
Problem

Research questions and friction points this paper is trying to address.

Inaccurate segmentation in large multi-modal models
Hallucinated comprehension due to weak visual understanding
Lack of fine-grained perception in segmentation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-Enhanced Feature Extractor for accurate segmentation
Interleaved Local Visual Coupling for fine-grained supervision
Attributes Evaluation dataset to quantify semantic inference
🔎 Similar Papers
No similar papers found.
Z
Zhang Li
Huazhong University of Science and Technology
Biao Yang
Biao Yang
Shanghai Jiao Tong University, Antai College of Economics and Management
Asset PricingClimate Finance
Q
Qiang Liu
Kingsoft Office
S
Shuo Zhang
Huazhong University of Science and Technology
Z
Zhiyin Ma
Huazhong University of Science and Technology
L
Liang Yin
Huazhong University of Science and Technology
Linger Deng
Linger Deng
Huazhong University of Science and Technology
Computer VisionMultimodal Large Language ModelsOptical Character Recognition
Y
Yabo Sun
Kingsoft Office
Y
Yuliang Liu
Huazhong University of Science and Technology
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR