PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Addressing the challenge of balancing temporal consistency and computational efficiency in stereo video depth estimation, this paper proposes the Pick-and-Play Memory Module—a lightweight long-term spatiotemporal memory mechanism. It employs a learnable buffer to dynamically select salient historical frames and adaptively aggregates their features via learned weights, enabling efficient long-range temporal modeling with minimal overhead. Integrated end-to-end with a stereo matching network, the module significantly enhances the stability and coherence of depth sequences. On the Sintel dataset, it achieves TEPE scores of 0.62 (clean) and 1.11 (final), outperforming BiDAStereo by 17.3% and 9.02%, respectively, while reducing FLOPs by 12.4%. The method is particularly suited for latency-sensitive, temporally demanding applications such as augmented reality.

Technology Category

Application Category

📝 Abstract

Temporally consistent depth estimation from stereo video is critical for real-world applications such as augmented reality, where inconsistent depth estimation disrupts the immersion of users. Despite its importance, this task remains challenging due to the difficulty in modeling long-term temporal consistency in a computationally efficient manner. Previous methods attempt to address this by aggregating spatio-temporal information but face a fundamental trade-off: limited temporal modeling provides only modest gains, whereas capturing long-range dependencies significantly increases computational cost. To address this limitation, we introduce a memory buffer for modeling long-range spatio-temporal consistency while achieving efficient dynamic stereo matching. Inspired by the two-stage decision-making process in humans, we propose a extbf{P}ick-and- extbf{P}lay extbf{M}emory (PPM) construction module for dynamic extbf{Stereo} matching, dubbed as extbf{PPMStereo}. PPM consists of a `pick' process that identifies the most relevant frames and a `play' process that weights the selected frames adaptively for spatio-temporal aggregation. This two-stage collaborative process maintains a compact yet highly informative memory buffer while achieving temporally consistent information aggregation. Extensive experiments validate the effectiveness of PPMStereo, demonstrating state-of-the-art performance in both accuracy and temporal consistency. % Notably, PPMStereo achieves 0.62/1.11 TEPE on the Sintel clean/final (17.3% & 9.02% improvements over BiDAStereo) with fewer computational costs. Codes are available at extcolor{blue}{https://github.com/cocowy1/PPMStereo}.

Problem

Research questions and friction points this paper is trying to address.

Achieving temporally consistent depth estimation from stereo video

Modeling long-term temporal consistency with computational efficiency

Resolving trade-off between limited temporal modeling and high computational cost

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pick-and-Play Memory module for stereo matching

Two-stage process selects and weights relevant frames

Compact memory achieves efficient long-range consistency

🔎 Similar Papers

No similar papers found.