DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Few-shot semantic segmentation (FSS) suffers from poor generalization to novel categories, primarily due to insufficient appearance diversity coverage in support images, leading to biased feature representations. To address this, we propose a multimodal prompt fusion framework. First, we leverage a multimodal large language model (MLLM) to generate class-level semantic descriptions and introduce learnable semantic tokens for query-adaptive prompting. Second, we design a dual-path prompt mechanism that jointly models linguistic priors and dense pixel-wise visual matching between query and support images. Finally, a prompt-driven decoder enables precise segmentation prediction. Our method achieves state-of-the-art performance on the Pascal-5ⁱ and COCO-20ⁱ benchmarks, demonstrating significant improvements in generalization to unseen categories and robustness across diverse scenes.

Technology Category

Application Category

📝 Abstract
Few-shot semantic segmentation (FSS) aims to enable models to segment novel/unseen object classes using only a limited number of labeled examples. However, current FSS methods frequently struggle with generalization due to incomplete and biased feature representations, especially when support images do not capture the full appearance variability of the target class. To improve the FSS pipeline, we propose a novel framework that utilizes large language models (LLMs) to adapt general class semantic information to the query image. Furthermore, the framework employs dense pixel-wise matching to identify similarities between query and support images, resulting in enhanced FSS performance. Inspired by reasoning-based segmentation frameworks, our method, named DSV-LFS, introduces an additional token into the LLM vocabulary, allowing a multimodal LLM to generate a"semantic prompt"from class descriptions. In parallel, a dense matching module identifies visual similarities between the query and support images, generating a"visual prompt". These prompts are then jointly employed to guide the prompt-based decoder for accurate segmentation of the query image. Comprehensive experiments on the benchmark datasets Pascal-$5^{i}$ and COCO-$20^{i}$ demonstrate that our framework achieves state-of-the-art performance-by a significant margin-demonstrating superior generalization to novel classes and robustness across diverse scenarios. The source code is available at href{https://github.com/aminpdik/DSV-LFS}{https://github.com/aminpdik/DSV-LFS}
Problem

Research questions and friction points this paper is trying to address.

Improves few-shot segmentation using LLM-driven semantic cues.
Enhances generalization by combining visual and semantic features.
Achieves state-of-the-art performance on benchmark datasets.
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs adapt semantic info to query images
Dense pixel-wise matching enhances FSS performance
Multimodal LLM generates semantic and visual prompts
🔎 Similar Papers
No similar papers found.