đ¤ AI Summary
Event cameras produce sparse, asynchronous, and high-temporal-resolution data, posing significant challenges for modeling structural and motion information. To address this, we propose Fast Feature Field (FÂł), a continuous spatiotemporal feature representation tailored for downstream vision tasks. FÂł implicitly encodes scene geometry and motion dynamics via a forward event prediction mechanism, and efficiently maps sparse event streams into dense, continuous, multi-channel spatiotemporal feature fields using multi-resolution hash encoding and a deep set network. The representation preserves event sparsity while ensuring temporal continuity, yielding strong robustness across varying illumination conditions, platforms, and sensor configurations. FÂł achieves state-of-the-art performance on optical flow estimation, semantic segmentation, and monocular metric depth estimation. It operates at up to 440 Hz on VGA-resolution inputs and 25â75 Hz on HD-resolution inputs, and has been validated on automotive, quadrupedal, and aerial robotic platforms.
đ Abstract
This paper develops a mathematical argument and algorithms for building representations of data from event-based cameras, that we call Fast Feature Field ($ ext{F}^3$). We learn this representation by predicting future events from past events and show that it preserves scene structure and motion information. $ ext{F}^3$ exploits the sparsity of event data and is robust to noise and variations in event rates. It can be computed efficiently using ideas from multi-resolution hash encoding and deep sets - achieving 120 Hz at HD and 440 Hz at VGA resolutions. $ ext{F}^3$ represents events within a contiguous spatiotemporal volume as a multi-channel image, enabling a range of downstream tasks. We obtain state-of-the-art performance on optical flow estimation, semantic segmentation, and monocular metric depth estimation, on data from three robotic platforms (a car, a quadruped robot and a flying platform), across different lighting conditions (daytime, nighttime), environments (indoors, outdoors, urban, as well as off-road) and dynamic vision sensors (resolutions and event rates). Our implementations can predict these tasks at 25-75 Hz at HD resolution.