🤖 AI Summary
Existing layer-wise sequential adaptation methods for audio-visual learning suffer from excessive parameter counts and high memory overhead. To address this, we propose MoLT—a lightweight, parallel deep adaptation framework. MoLT abandons full-layer sequential adaptation and instead performs parallel extraction and fusion of modality-specific tokens exclusively in the deeper Transformer layers. It introduces inter-layer token distillation and dynamic fusion mechanisms, coupled with orthogonal regularization to suppress redundancy and prevent error propagation from early layers. Furthermore, MoLT employs dual adapters: one dedicated to modeling modality-specific characteristics and the other to capturing cross-modal interactions. Evaluated on multiple audio-visual benchmarks—including visual question answering, segmentation, and event localization—MoLT achieves superior performance over state-of-the-art methods while using significantly fewer parameters and substantially lower memory consumption, demonstrating its efficiency, robustness, and strong generalization capability.
📝 Abstract
In this paper, we propose Mixture of Layer-Wise Tokens (MoLT), a parameter- and memory-efficient adaptation framework for audio-visual learning. The key idea of MoLT is to replace conventional, computationally heavy sequential adaptation at every transformer layer with a parallel, lightweight scheme that extracts and fuses layer-wise tokens only from the late layers. We adopt two types of adapters to distill modality-specific information and cross-modal interaction into compact latent tokens in a layer-wise manner. A token fusion module then dynamically fuses these layer-wise tokens by taking into account their relative significance. To prevent the redundancy of latent tokens, we apply an orthogonality regularization between latent tokens during training. Through the systematic analysis of the position of adaptation in the pre-trained transformers, we extract latent tokens only from the late layers of the transformers. This strategic adaptation approach avoids error propagation from the volatile early-layer features, thereby maximizing the adaptation performance while maintaining parameter and memory efficiency. Through extensive experiments, we demonstrate that MoLT outperforms existing methods on diverse audio-visual benchmarks, including Audio-Visual Question Answering, Audio-Visual Segmentation, and Audio-Visual Event Localization.