STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification

📅 2025-03-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In multimodal classification, existing semi-supervised methods neglect modality-specific task information, exacerbating semantic discrepancies across modalities. To address this, we propose a collaborative semi-supervised learning framework for image and structured tabular data. Our method introduces: (1) a novel decoupled contrastive consistency module that separately models modality-shared and modality-specific representations; and (2) a consensus-guided pseudo-label generation scheme coupled with prototype-embedding-driven label smoothing to mitigate cross-modal inconsistency. Evaluated on natural and medical imaging multimodal benchmarks, our approach significantly outperforms state-of-the-art supervised, self-supervised, and semi-supervised unimodal and multimodal methods. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Multimodal image-tabular learning is gaining attention, yet it faces challenges due to limited labeled data. While earlier work has applied self-supervised learning (SSL) to unlabeled data, its task-agnostic nature often results in learning suboptimal features for downstream tasks. Semi-supervised learning (SemiSL), which combines labeled and unlabeled data, offers a promising solution. However, existing multimodal SemiSL methods typically focus on unimodal or modality-shared features, ignoring valuable task-relevant modality-specific information, leading to a Modality Information Gap. In this paper, we propose STiL, a novel SemiSL tabular-image framework that addresses this gap by comprehensively exploring task-relevant information. STiL features a new disentangled contrastive consistency module to learn cross-modal invariant representations of shared information while retaining modality-specific information via disentanglement. We also propose a novel consensus-guided pseudo-labeling strategy to generate reliable pseudo-labels based on classifier consensus, along with a new prototype-guided label smoothing technique to refine pseudo-label quality with prototype embeddings, thereby enhancing task-relevant information learning in unlabeled data. Experiments on natural and medical image datasets show that STiL outperforms the state-of-the-art supervised/SSL/SemiSL image/multimodal approaches. Our code is publicly available.
Problem

Research questions and friction points this paper is trying to address.

Addresses limited labeled data in multimodal image-tabular learning.
Overcomes task-agnostic SSL by exploring task-relevant modality-specific information.
Proposes STiL to bridge Modality Information Gap using SemiSL techniques.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled contrastive consistency for cross-modal learning
Consensus-guided pseudo-labeling for reliable label generation
Prototype-guided label smoothing to refine pseudo-label quality