🤖 AI Summary
In multimodal joint training, dominant modalities overpower backpropagation, causing optimization imbalance: (i) weakening coupling between late-stage representations and outputs while accumulating redundant information; and (ii) existing gradient regulation methods neglect inter-modal semantic correlations and directional dependencies. To address this, we propose Adaptive Redundancy Control (ARC), a semantic-aware gradient regulation framework. ARC introduces a redundancy-phase monitoring mechanism grounded in the information bottleneck principle and employs a co-information gating module to dynamically assess cross-modal semantic contributions. Crucially, it applies orthogonal gradient suppression *only* to the dominant modality when redundancy exceeds a threshold—preserving unimodal discriminative signals without uniform scaling. Its core innovation lies in directionally constrained, semantics-preserving gradient modulation. ARC achieves significant improvements over state-of-the-art methods across multiple benchmarks; ablation studies validate the efficacy of each component; and the code is publicly available.
📝 Abstract
Multimodal learning aims to improve performance by leveraging data from multiple sources. During joint multimodal training, due to modality bias, the advantaged modality often dominates backpropagation, leading to imbalanced optimization. Existing methods still face two problems: First, the long-term dominance of the dominant modality weakens representation-output coupling in the late stages of training, resulting in the accumulation of redundant information. Second, previous methods often directly and uniformly adjust the gradients of the advantaged modality, ignoring the semantics and directionality between modalities. To address these limitations, we propose Adaptive Redundancy Regulation for Balanced Multimodal Information Refinement (RedReg), which is inspired by information bottleneck principle. Specifically, we construct a redundancy phase monitor that uses a joint criterion of effective gain growth rate and redundancy to trigger intervention only when redundancy is high. Furthermore, we design a co-information gating mechanism to estimate the contribution of the current dominant modality based on cross-modal semantics. When the task primarily relies on a single modality, the suppression term is automatically disabled to preserve modality-specific information. Finally, we project the gradient of the dominant modality onto the orthogonal complement of the joint multimodal gradient subspace and suppress the gradient according to redundancy. Experiments show that our method demonstrates superiority among current major methods in most scenarios. Ablation experiments verify the effectiveness of our method. The code is available at https://github.com/xia-zhe/RedReg.git