Densely Connected Parameter-Efficient Tuning for Referring Image Segmentation

📅 2025-01-15

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

In referring image segmentation, insufficient cross-modal feature interaction arises from misalignment between multimodal encoders. To address this, we propose DETRIS—a novel framework that enhances alignment and interaction between vision and language modalities. Methodologically, DETRIS introduces (1) a dense cross-layer connection architecture to strengthen propagation of low-rank visual features into deeper Transformer layers, and (2) a text adapter that jointly optimizes cross-modal alignment with a vision adapter. Leveraging low-rank decomposition, our parameter-efficient fine-tuning updates only 0.9%–1.8% of the backbone parameters. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate consistent and significant improvements over state-of-the-art methods—particularly under encoder misalignment conditions—where DETRIS exhibits superior robustness and adaptability. Our work establishes a scalable, alignment-aware paradigm for multimodal prompt-based tuning (PET), advancing efficient and effective cross-modal representation learning.

Technology Category

Application Category

📝 Abstract

In the domain of computer vision, Parameter-Efficient Tuning (PET) is increasingly replacing the traditional paradigm of pre-training followed by full fine-tuning. PET is particularly favored for its effectiveness in large foundation models, as it streamlines transfer learning costs and optimizes hardware utilization. However, the current PET methods are mainly designed for single-modal optimization. While some pioneering studies have undertaken preliminary explorations, they still remain at the level of aligned encoders (e.g., CLIP) and lack exploration of misaligned encoders. These methods show sub-optimal performance with misaligned encoders, as they fail to effectively align the multimodal features during fine-tuning. In this paper, we introduce DETRIS, a parameter-efficient tuning framework designed to enhance low-rank visual feature propagation by establishing dense interconnections between each layer and all preceding layers, which enables effective cross-modal feature interaction and adaptation to misaligned encoders. We also suggest using text adapters to improve textual features. Our simple yet efficient approach greatly surpasses state-of-the-art methods with 0.9% to 1.8% backbone parameter updates, evaluated on challenging benchmarks. Our project is available at url{https://github.com/jiaqihuang01/DETRIS}.

Problem

Research questions and friction points this paper is trying to address.

Parameter Efficient Tuning

Mismatched Modalities

Alignment Difficulty

Innovation

Methods, ideas, or system contributions that make the work stand out.

DETRIS

Parameter-Efficient Tuning

Visual-Textual Information Flow

🔎 Similar Papers

How to build the best medical image segmentation algorithm using foundation models: a comprehensive empirical study with Segment Anything Model