🤖 AI Summary
Existing methods for multimodal dataset selection are often hindered by dominance from a single modality and coarse-grained scoring biases, compromising cross-modal semantic integrity and distributional consistency. This work proposes a collapse-aware multiscale topological fusion framework that constructs image-text modality topologies and introduces a local collapse-aware fusion strategy. By performing multiscale distribution alignment in the diffusion wavelet domain and incorporating a relation-aware soft coverage mechanism for coreset selection, the approach jointly models global semantic structure, local fine-grained details, and redundancy in dense regions—surmounting the limitations of conventional unimodal or coarse-grained sampling paradigms. The method significantly outperforms state-of-the-art techniques on Flickr30K and MS-COCO, while also demonstrating superior cross-architecture generalization and energy efficiency compared to current multimodal synthesis approaches.
📝 Abstract
The training of large multimodal models fundamentally relies on massive image-text datasets, which inevitably incur prohibitive computational overhead. Dataset selection offers a promising paradigm by identifying a highly informative coreset. However, existing approaches suffer from two critical limitations: (i) single-modality-dominated sampling methods, which ignore the fine-grained cross-modal information imbalance inherent in multimodal datasets and thus lead to semantic loss in the other modality; and (ii) coarse-grained sample-scoring-based sampling methods, where the selected coreset tends to be biased toward the scoring model, making it difficult to guarantee distributional equivalence between the coreset and the original dataset. Meanwhile, existing distribution matching and discrete sampling strategies often fail to jointly account for global semantic structure, local fine-grained details, and redundancy-aware coverage in dense regions. To this end, we propose CAST, a Collapse-Aware multi-Scale Topology fusion framework for multimodal coreset selection. We first construct image- and text-modality topologies, and derive a unified topology via local-collapse-aware refinement and cross-modal fusion. We then introduce a multi-scale distribution matching criterion in the diffusion wavelet domain, encouraging the coreset to approximate the original dataset at multiple scales. Finally, we introduce a local soft relational coverage mechanism that extends pure geometric coverage to relation-aware indirect coverage, penalizing redundant selections in dense clusters. Extensive experiments on Flickr30K and MS-COCO show that CAST outperforms existing dataset selection baselines, showcasing great superiority in cross-architecture generalization and energy efficiency over state-of-the-art multimodal synthesis methods.