🤖 AI Summary
Existing motion estimation benchmarks lack the capability to jointly evaluate multi-task models that fuse RGB frames and event data, hindering unified assessment of optical flow, scene flow, and point tracking. To address this, we propose the first RGB–event joint benchmark tailored for multi-task visual motion estimation—introducing event camera data systematically into coordinated evaluation across all three tasks. Leveraging DVS simulation and NeRF-driven dynamic scene synthesis, we construct 12 realistic and synthetic sequences encompassing five distinct motion patterns. We design a cross-modal alignment annotation strategy and render high-fidelity ground-truth motion fields, complemented by a multi-scale error metric framework. This benchmark enables fair, task-agnostic comparison across optical flow, scene flow, and point tracking, significantly enhancing the rigor of generalization evaluation under cross-modal and cross-scene settings.