🤖 AI Summary
To address the insufficient interpretability of monocular visual-inertial odometry (VIO) in safety-critical applications, this paper proposes a Transformer-based generative adversarial framework for error-driven iterative optimization. The method introduces a critic-guided multi-round pose trajectory refinement mechanism and, for the first time, enables self-emergent learning of sensor-specific weighting coefficients—facilitating physically interpretable visualization of visual versus inertial modality contributions. By jointly modeling image sequences and 6-DoF inertial measurements, the framework integrates feature fusion, dynamic modality weighting, and adversarial training to significantly improve both prediction accuracy and decision transparency. Evaluated on the KITTI dataset, the approach achieves state-of-the-art translational and rotational accuracy among learning-based VIO methods, while additionally providing verifiable perceptual attention analysis.
📝 Abstract
We introduce XIRVIO, a transformer-based Generative Adversarial Network (GAN) framework for monocular visual inertial odometry (VIO). By taking sequences of images and 6-DoF inertial measurements as inputs, XIRVIO's generator predicts pose trajectories through an iterative refinement process which are then evaluated by the critic to select the iteration with the optimised prediction. Additionally, the self-emergent adaptive sensor weighting reveals how XIRVIO attends to each sensory input based on contextual cues in the data, making it a promising approach for achieving explainability in safety-critical VIO applications. Evaluations on the KITTI dataset demonstrate that XIRVIO matches well-known state-of-the-art learning-based methods in terms of both translation and rotation errors.