🤖 AI Summary
RGB and event cameras offer complementary modalities but exhibit significant spatiotemporal asymmetry—high spatial resolution in RGB versus high temporal resolution and high dynamic range (HDR) in event streams—hindering multimodal object tracking performance. To address this, we propose a hierarchical asymmetric distillation framework that explicitly mitigates modality discrepancies via layered feature alignment and spatiotemporal consistency modeling. Our approach enables efficient cross-modal knowledge transfer into a lightweight student network. By integrating multimodal knowledge distillation with joint optimization, it achieves substantial improvements over state-of-the-art methods across multiple benchmarks. Ablation studies confirm the effectiveness and necessity of both hierarchical alignment and asymmetric distillation design. To our knowledge, this is the first work to systematically model and alleviate the spatiotemporal asymmetry between RGB and event modalities, yielding a compact yet accurate multimodal tracker that balances precision and efficiency.
📝 Abstract
RGB cameras excel at capturing rich texture details with high spatial resolution, whereas event cameras offer exceptional temporal resolution and a high dynamic range (HDR). Leveraging their complementary strengths can substantially enhance object tracking under challenging conditions, such as high-speed motion, HDR environments, and dynamic background interference. However, a significant spatio-temporal asymmetry exists between these two modalities due to their fundamentally different imaging mechanisms, hindering effective multi-modal integration. To address this issue, we propose {Hierarchical Asymmetric Distillation} (HAD), a multi-modal knowledge distillation framework that explicitly models and mitigates spatio-temporal asymmetries. Specifically, HAD proposes a hierarchical alignment strategy that minimizes information loss while maintaining the student network's computational efficiency and parameter compactness. Extensive experiments demonstrate that HAD consistently outperforms state-of-the-art methods, and comprehensive ablation studies further validate the effectiveness and necessity of each designed component. The code will be released soon.