UniTriGen: Unified Triplet Generation of Aligned Visible-Infrared-Label for Few-Shot RGB-T Semantic Segmentation

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This work addresses the challenge in RGB-T semantic segmentation caused by the scarcity of strictly aligned visible-infrared-label triplets and the inability of existing generation methods to ensure cross-modal spatial alignment and semantic consistency. To overcome this, we propose UniTriGen, the first unified triplet generation framework that jointly models all three modalities through a shared latent-space diffusion model, augmented with lightweight modality-specific residual adapters to preserve fine-grained details. Generation is guided by textual prompts, and a scene-balanced, class-aware few-shot sampling strategy is introduced to enhance diversity and semantic fidelity. Using only a small amount of real paired data, UniTriGen produces high-quality, semantically consistent, and modality-complementary triplets, significantly boosting the performance of various RGB-T segmentation models.

📝 Abstract

RGB-T semantic segmentation requires strictly aligned VIS-IR-Label triplets; however, such aligned triplet data are often scarce in real-world scenarios. Existing generative augmentation methods usually adopt cascaded generation paradigms, decomposing joint triplet generation into local conditional processes. As a result, consistency among VIS, IR, and Label in spatial structure, semantic content, and cross-modal details cannot be reliably maintained. To address this issue, we propose UniTriGen, a unified triplet generation framework that directly generates spatially aligned, semantically consistent, and modality complementary VIS-IR-Label triplets under the guidance of text prompts. UniTriGen first introduces a unified triplet generation mechanism, where VIS, IR, and Label are jointly encoded into a shared latent space and modeled with a diffusion process to enforce global cross-modal consistency. Lightweight modality-specific residual adapters are further integrated into this mechanism to accommodate modality-specific imaging characteristics and output formats. To mitigate generation bias caused by imbalanced scene and class distributions in limited paired triplets, UniTriGen also employs a scene-balanced and class-aware few-shot sampling strategy, which induces a more balanced sampling distribution and enhances the scene and class diversity of generated triplets. Experiments show that UniTriGen generates high-quality aligned triplets from limited real paired data, thereby achieving consistent performance improvements across various RGB-T semantic segmentation models.

Problem

Research questions and friction points this paper is trying to address.

RGB-T semantic segmentation

aligned triplet generation

cross-modal consistency

few-shot learning

data scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Triplet Generation

Diffusion Model

Cross-Modal Consistency

Modality-Specific Adapters

Few-Shot Sampling

🔎 Similar Papers

No similar papers found.