🤖 AI Summary
This work addresses the significant sim-to-real domain gap in existing event-based motion estimation methods, which heavily rely on synthetic data. The authors propose TETO, a novel framework that, for the first time, leverages only 25 minutes of unlabeled real-world event data to transfer motion knowledge from a pretrained RGB tracker via teacher-student knowledge distillation. By integrating motion-aware data filtering and query sampling strategies, TETO effectively disentangles object motion from camera ego-motion. The method jointly predicts point trajectories and optical flow, which are then used to condition a video diffusion Transformer for high-quality intermediate frame interpolation. Evaluated on EVIMO2 and DSEC, TETO achieves state-of-the-art performance in point tracking and optical flow estimation, and substantially improves video interpolation quality on BS-ERGB and HQ-EVFI benchmarks.
📝 Abstract
Event cameras capture per-pixel brightness changes with microsecond resolution, offering continuous motion information lost between RGB frames. However, existing event-based motion estimators depend on large-scale synthetic data that often suffers from a significant sim-to-real gap. We propose TETO (Tracking Events with Teacher Observation), a teacher-student framework that learns event motion estimation from only $\sim$25 minutes of unannotated real-world recordings through knowledge distillation from a pretrained RGB tracker. Our motion-aware data curation and query sampling strategy maximizes learning from limited data by disentangling object motion from dominant ego-motion. The resulting estimator jointly predicts point trajectories and dense optical flow, which we leverage as explicit motion priors to condition a pretrained video diffusion transformer for frame interpolation. We achieve state-of-the-art point tracking on EVIMO2 and optical flow on DSEC using orders of magnitude less training data, and demonstrate that accurate motion estimation translates directly to superior frame interpolation quality on BS-ERGB and HQ-EVFI.