Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation

📅 2025-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In referring image segmentation, insufficient cross-modal feature interaction arises from misalignment between multimodal encoders. To address this, we propose DETRIS—a novel framework that enhances alignment and interaction between vision and language modalities. Methodologically, DETRIS introduces (1) a dense cross-layer connection architecture to strengthen propagation of low-rank visual features into deeper Transformer layers, and (2) a text adapter that jointly optimizes cross-modal alignment with a vision adapter. Leveraging low-rank decomposition, our parameter-efficient fine-tuning updates only 0.9%–1.8% of the backbone parameters. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate consistent and significant improvements over state-of-the-art methods—particularly under encoder misalignment conditions—where DETRIS exhibits superior robustness and adaptability. Our work establishes a scalable, alignment-aware paradigm for multimodal prompt-based tuning (PET), advancing efficient and effective cross-modal representation learning.

Technology Category

Application Category

📝 Abstract
In the domain of computer vision, Parameter-Efficient Tuning (PET) is increasingly replacing the traditional paradigm of pre-training followed by full fine-tuning. PET is particularly favored for its effectiveness in large foundation models, as it streamlines transfer learning costs and optimizes hardware utilization. However, the current PET methods are mainly designed for single-modal optimization. While some pioneering studies have undertaken preliminary explorations, they still remain at the level of aligned encoders (e.g., CLIP) and lack exploration of misaligned encoders. These methods show sub-optimal performance with misaligned encoders, as they fail to effectively align the multimodal features during fine-tuning. In this paper, we introduce DETRIS, a parameter-efficient tuning framework designed to enhance low-rank visual feature propagation by establishing dense interconnections between each layer and all preceding layers, which enables effective cross-modal feature interaction and adaptation to misaligned encoders. We also suggest using text adapters to improve textual features. Our simple yet efficient approach greatly surpasses state-of-the-art methods with 0.9% to 1.8% backbone parameter updates, evaluated on challenging benchmarks. Our project is available at url{https://github.com/jiaqihuang01/DETRIS}.
Problem

Research questions and friction points this paper is trying to address.

Parameter Efficient Tuning
Mismatched Modalities
Alignment Difficulty
Innovation

Methods, ideas, or system contributions that make the work stand out.

DETRIS
Parameter-Efficient Tuning
Visual-Textual Information Flow
🔎 Similar Papers
No similar papers found.
Jiaqi Huang
Jiaqi Huang
University of Central Missouri
CybersecurityIoV
Zunnan Xu
Zunnan Xu
Tsinghua University
Computer VisionMachine Learning
T
Ting Liu
Tsinghua Shenzhen International Graduate School, Tsinghua University
Y
Yong Liu
Tsinghua Shenzhen International Graduate School, Tsinghua University
Haonan Han
Haonan Han
PhD Candidate, Tsinghua University
computer visionmultimodal generation
K
Kehong Yuan
Tsinghua Shenzhen International Graduate School, Tsinghua University
Xiu Li
Xiu Li
Bytedance Seed
Computer VisionComputer Graphics3D Vision