Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition

📅 2026-04-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches struggle to adapt to dynamically varying reliability across modalities and lack effective modeling of fine-grained action cues, thereby limiting fusion performance. To address these challenges, this work proposes a Mixture-of-Modality-Experts framework coupled with a Holistic Token Learning strategy. The former enhances modality-specific expertise and interpretability through adaptively collaborative modality experts, while the latter jointly optimizes class and spatiotemporal tokens to enable knowledge-driven holistic representation learning. Evaluated on driver action recognition benchmarks, the proposed method significantly outperforms both unimodal and state-of-the-art multimodal baselines, demonstrating its superiority in fine-grained understanding and interpretable multimodal fusion.
📝 Abstract
Robust multimodal visual analytics remains challenging when heterogeneous modalities provide complementary but input-dependent evidence for decision-making.Existing multimodal learning methods mainly rely on fixed fusion modules or predefined cross-modal interactions, which are often insufficient to adapt to changing modality reliability and to capture fine-grained action cues. To address this issue, we propose a Mixture-of-Modality-Experts (MoME) framework with a Holistic Token Learning (HTL) strategy. MoME enables adaptive collaboration among modality-specific experts, while HTL improves both intra-expert refinement and inter-expert knowledge transfer through class tokens and spatio-temporal tokens. In this way, our method forms a knowledge-centric multimodal learning framework that improves expert specialization while reducing ambiguity in multimodal fusion.We validate the proposed framework on driver action recognition as a representative multimodal understanding taskThe experimental results on the public benchmark show that the proposed MoME framework and the HTL strategy jointly outperform representative single-modal and multimodal baselines. Additional ablation, validation, and visualization results further verify that the proposed HTL strategy improves subtle multimodal understanding and offers better interpretability.
Problem

Research questions and friction points this paper is trying to address.

multimodal visual analytics
driver action recognition
fine-grained action cues
modality reliability
heterogeneous modalities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Modality-Experts
Holistic Token Learning
multimodal fusion
fine-grained action recognition
adaptive collaboration
🔎 Similar Papers
No similar papers found.