Fast Feature Field ($ ext{F}^3$): A Predictive Representation of Events

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Event cameras produce sparse, asynchronous, and high-temporal-resolution data, posing significant challenges for modeling structural and motion information. To address this, we propose Fast Feature Field (F³), a continuous spatiotemporal feature representation tailored for downstream vision tasks. F³ implicitly encodes scene geometry and motion dynamics via a forward event prediction mechanism, and efficiently maps sparse event streams into dense, continuous, multi-channel spatiotemporal feature fields using multi-resolution hash encoding and a deep set network. The representation preserves event sparsity while ensuring temporal continuity, yielding strong robustness across varying illumination conditions, platforms, and sensor configurations. F³ achieves state-of-the-art performance on optical flow estimation, semantic segmentation, and monocular metric depth estimation. It operates at up to 440 Hz on VGA-resolution inputs and 25–75 Hz on HD-resolution inputs, and has been validated on automotive, quadrupedal, and aerial robotic platforms.

Technology Category

Application Category

📝 Abstract

This paper develops a mathematical argument and algorithms for building representations of data from event-based cameras, that we call Fast Feature Field ($ ext{F}^3$). We learn this representation by predicting future events from past events and show that it preserves scene structure and motion information. $ ext{F}^3$ exploits the sparsity of event data and is robust to noise and variations in event rates. It can be computed efficiently using ideas from multi-resolution hash encoding and deep sets - achieving 120 Hz at HD and 440 Hz at VGA resolutions. $ ext{F}^3$ represents events within a contiguous spatiotemporal volume as a multi-channel image, enabling a range of downstream tasks. We obtain state-of-the-art performance on optical flow estimation, semantic segmentation, and monocular metric depth estimation, on data from three robotic platforms (a car, a quadruped robot and a flying platform), across different lighting conditions (daytime, nighttime), environments (indoors, outdoors, urban, as well as off-road) and dynamic vision sensors (resolutions and event rates). Our implementations can predict these tasks at 25-75 Hz at HD resolution.

Problem

Research questions and friction points this paper is trying to address.

Building predictive representations from event-based camera data

Preserving scene structure and motion information from events

Enabling downstream vision tasks across diverse robotic platforms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns event representation by predicting future events

Uses multi-resolution hash encoding for efficient computation

Represents events as multi-channel spatiotemporal image

🔎 Similar Papers

No similar papers found.