CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

224K/year
🤖 AI Summary
Existing methods for multimodal dataset selection are often hindered by dominance from a single modality and coarse-grained scoring biases, compromising cross-modal semantic integrity and distributional consistency. This work proposes a collapse-aware multiscale topological fusion framework that constructs image-text modality topologies and introduces a local collapse-aware fusion strategy. By performing multiscale distribution alignment in the diffusion wavelet domain and incorporating a relation-aware soft coverage mechanism for coreset selection, the approach jointly models global semantic structure, local fine-grained details, and redundancy in dense regions—surmounting the limitations of conventional unimodal or coarse-grained sampling paradigms. The method significantly outperforms state-of-the-art techniques on Flickr30K and MS-COCO, while also demonstrating superior cross-architecture generalization and energy efficiency compared to current multimodal synthesis approaches.
📝 Abstract
The training of large multimodal models fundamentally relies on massive image-text datasets, which inevitably incur prohibitive computational overhead. Dataset selection offers a promising paradigm by identifying a highly informative coreset. However, existing approaches suffer from two critical limitations: (i) single-modality-dominated sampling methods, which ignore the fine-grained cross-modal information imbalance inherent in multimodal datasets and thus lead to semantic loss in the other modality; and (ii) coarse-grained sample-scoring-based sampling methods, where the selected coreset tends to be biased toward the scoring model, making it difficult to guarantee distributional equivalence between the coreset and the original dataset. Meanwhile, existing distribution matching and discrete sampling strategies often fail to jointly account for global semantic structure, local fine-grained details, and redundancy-aware coverage in dense regions. To this end, we propose CAST, a Collapse-Aware multi-Scale Topology fusion framework for multimodal coreset selection. We first construct image- and text-modality topologies, and derive a unified topology via local-collapse-aware refinement and cross-modal fusion. We then introduce a multi-scale distribution matching criterion in the diffusion wavelet domain, encouraging the coreset to approximate the original dataset at multiple scales. Finally, we introduce a local soft relational coverage mechanism that extends pure geometric coverage to relation-aware indirect coverage, penalizing redundant selections in dense clusters. Extensive experiments on Flickr30K and MS-COCO show that CAST outperforms existing dataset selection baselines, showcasing great superiority in cross-architecture generalization and energy efficiency over state-of-the-art multimodal synthesis methods.
Problem

Research questions and friction points this paper is trying to address.

multimodal coreset selection
cross-modal imbalance
distributional equivalence
redundancy-aware coverage
fine-grained details
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal coreset selection
topology fusion
multi-scale distribution matching
collapse-aware refinement
relational coverage
🔎 Similar Papers
No similar papers found.
B
Boran Zhao
School of Software Engineering, the National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
H
Hetian Liu
School of Software Engineering, Xi’an Jiaotong University
Z
Zhenxian Hu
XJTU-POLIMI Joint School, Xi’an Jiaotong University
Y
Yuqing Yuan
Faculty of Electronic and Information Engineering, Xi’an Jiaotong University
Y
Yu Yan
School of Human Settlements and Civil Engineering, Xi’an Jiaotong University
Pengju Ren
Pengju Ren
Professor, Xi'an Jiaotong University