Decoupled Audio-Visual Dataset Distillation

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Audio-visual dataset distillation faces two key challenges: (1) inconsistent cross-modal mapping spaces hinder effective alignment, and (2) direct inter-modal interaction degrades modality-specific private information. To address these, we propose DAVDD—a pretraining-driven disentangled distillation framework—that explicitly separates shared (cross-modal common) and private (modality-specific) representations for the first time. Methodologically: (1) dual-modal encoders are initialized using a pretrained feature bank to ensure mapping space consistency; (2) a lightweight disentangler bank jointly performs cross-modal common matching and sample-distribution alignment; and (3) redundant inter-modal interactions are avoided to preserve essential modality-specific characteristics. Extensive experiments across multiple benchmarks and varying image-per-class (IPC) settings demonstrate that DAVDD consistently outperforms state-of-the-art methods, validating its effectiveness in enhancing distilled data quality and cross-modal generalization capability.

Technology Category

Application Category

📝 Abstract

Audio-Visual Dataset Distillation aims to compress large-scale datasets into compact subsets while preserving the performance of the original data. However, conventional Distribution Matching (DM) methods struggle to capture intrinsic cross-modal alignment. Subsequent studies have attempted to introduce cross-modal matching, but two major challenges remain: (i) independently and randomly initialized encoders lead to inconsistent modality mapping spaces, increasing training difficulty; and (ii) direct interactions between modalities tend to damage modality-specific (private) information, thereby degrading the quality of the distilled data. To address these challenges, we propose DAVDD, a pretraining-based decoupled audio-visual distillation framework. DAVDD leverages a diverse pretrained bank to obtain stable modality features and uses a lightweight decoupler bank to disentangle them into common and private representations. To effectively preserve cross-modal structure, we further introduce Common Intermodal Matching together with a Sample-Distribution Joint Alignment strategy, ensuring that shared representations are aligned both at the sample level and the global distribution level. Meanwhile, private representations are entirely isolated from cross-modal interaction, safeguarding modality-specific cues throughout distillation. Extensive experiments across multiple benchmarks show that DAVDD achieves state-of-the-art results under all IPC settings, demonstrating the effectiveness of decoupled representation learning for high-quality audio-visual dataset distillation. Code will be released.

Problem

Research questions and friction points this paper is trying to address.

Capturing intrinsic cross-modal alignment in dataset distillation

Inconsistent modality mapping spaces from independent encoder initialization

Preserving modality-specific information during cross-modal interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled framework separates common and private representations

Pretrained bank provides stable modality feature initialization

Joint alignment preserves cross-modal structure at multiple levels

🔎 Similar Papers

No similar papers found.