🤖 AI Summary
This work addresses the optimization bias arising from inter-modal gradient conflicts in multimodal table-image fusion by proposing a gradient-aligned alternating learning paradigm. The approach alternates between unimodal training phases and employs a shared classifier with decoupled gradients, further enhanced by a cross-modal gradient surgery mechanism grounded in uncertainty estimation to effectively harmonize modality collaboration. Evaluated on multiple standard benchmarks, the proposed framework substantially outperforms existing table-image fusion methods and demonstrates superior robustness under test-time scenarios involving missing tabular data, thereby achieving concurrent improvements in both performance and reliability.
📝 Abstract
Multimodal tabular-image fusion is an emerging task that has received increasing attention in various domains. However, existing methods may be hindered by gradient conflicts between modalities, misleading the optimization of the unimodal learner. In this paper, we propose a novel Gradient-Aligned Alternating Learning (GAAL) paradigm to address this issue by aligning modality gradients. Specifically, GAAL adopts an alternating unimodal learning and shared classifier to decouple the multimodal gradient and facilitate interaction. Furthermore, we design uncertainty-based cross-modal gradient surgery to selectively align cross-modal gradients, thereby steering the shared parameters to benefit all modalities. As a result, GAAL can provide effective unimodal assistance and help boost the overall fusion performance. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SoTA) tabular-image fusion baselines and test-time tabular missing baselines. The source code is available at https://github.com/njustkmg/ICME26-GAAL.