🤖 AI Summary
Joint estimation of 6-DoF ego-motion and dense optical flow from event data is highly ill-posed. Existing unsupervised approaches rely on explicit regularization or structural priors, which often introduce bias, incur high computational cost, or converge to poor local minima. This paper proposes an unsupervised joint learning framework that models camera motion implicitly via continuous B-spline representations and encodes optical flow using neural implicit functions. We introduce spatiotemporal consistency priors and differential-geometric constraints—including photometric constancy and motion-geometry constraints—to enable end-to-end implicit regularization, thereby avoiding explicit smoothness assumptions and deficiencies inherent in depth parameterization. Evaluated across multiple 6-DoF motion scenarios, our method achieves state-of-the-art performance among unsupervised approaches and attains accuracy comparable to supervised methods.
📝 Abstract
The estimation of optical flow and 6-DoF ego-motion, two fundamental tasks in 3D vision, has typically been addressed independently. For neuromorphic vision (e.g., event cameras), however, the lack of robust data association makes solving the two problems separately an ill-posed challenge, especially in the absence of supervision via ground truth. Existing works mitigate this ill-posedness by either enforcing the smoothness of the flow field via an explicit variational regularizer or leveraging explicit structure-and-motion priors in the parametrization to improve event alignment. The former notably introduces bias in results and computational overhead, while the latter, which parametrizes the optical flow in terms of the scene depth and the camera motion, often converges to suboptimal local minima. To address these issues, we propose an unsupervised framework that jointly optimizes egomotion and optical flow via implicit spatial-temporal and geometric regularization. First, by modeling camera's egomotion as a continuous spline and optical flow as an implicit neural representation, our method inherently embeds spatial-temporal coherence through inductive biases. Second, we incorporate structure-and-motion priors through differential geometric constraints, bypassing explicit depth estimation while maintaining rigorous geometric consistency. As a result, our framework (called E-MoFlow) unifies egomotion and optical flow estimation via implicit regularization under a fully unsupervised paradigm. Experiments demonstrate its versatility to general 6-DoF motion scenarios, achieving state-of-the-art performance among unsupervised methods and competitive even with supervised approaches.