Stabilizing Streaming Video Geometry via Dynamic Feature Normalization

📅 2026-05-24

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses temporal inconsistency in monocular depth estimation from video streams—manifested as scale and shift drifts—caused by fluctuations in internal feature statistics. To mitigate this, the authors propose Dynamic Feature Normalization (DyFN), a lightweight causal recurrent module that adjusts feature statistics dynamically to stabilize geometric outputs. Occupying only 2% of the model parameters, DyFN can be fine-tuned while keeping the backbone frozen, enabling efficient adaptation without compromising single-frame accuracy. The study is the first to attribute temporal instability directly to feature statistic variations and demonstrates significant reduction of temporal artifacts across four benchmarks. The method improves temporal consistency by up to 14% over existing streaming approaches and even surpasses more complex non-causal video baselines, all while preserving per-frame depth accuracy.

📝 Abstract

Consistent 3D geometry estimation from streaming RGB input is crucial for real-world applications such as autonomous driving, embodied AI, and large-scale reconstruction. While modern monocular geometry foundation models achieve strong single-image accuracy, they exhibit severe temporal inconsistency on continuous input, notably dominated by scale--shift drifting. Through targeted empirical analysis, we trace this instability to its root cause: fluctuations in latent feature statistics, whose mean and variance directly determine the predicted depth's scale and shift. Building on this insight, we introduce Dynamic Feature Normalization (DyFN), a lightweight, causal recurrent module that dynamically and robustly modulates feature statistics to maintain stable geometry over time. We adapt powerful pretrained monocular geometry models for streaming by finetuning only DyFN, a mere 2\% additional parameters, while keeping the backbone frozen, thereby achieving temporal consistency without compromising single-image accuracy. Extensive experiments across four benchmarks show that DyFN effectively eliminates temporal artifacts such as disjointed layering and positional jitter, and achieves state-of-the-art temporal stability, improving over prior streaming methods by up to 14\% and even outperforming heavier non-causal video baselines. Project Page: https://shawlyu.github.io/DyFN

Problem

Research questions and friction points this paper is trying to address.

temporal inconsistency

scale-shift drifting

streaming video geometry

monocular depth estimation

feature statistics fluctuation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Feature Normalization

Temporal Consistency

Monocular Geometry