PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching

πŸ“… 2025-10-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Addressing the challenge of balancing temporal consistency and computational efficiency in stereo video depth estimation, this paper proposes the Pick-and-Play Memory Moduleβ€”a lightweight long-term spatiotemporal memory mechanism. It employs a learnable buffer to dynamically select salient historical frames and adaptively aggregates their features via learned weights, enabling efficient long-range temporal modeling with minimal overhead. Integrated end-to-end with a stereo matching network, the module significantly enhances the stability and coherence of depth sequences. On the Sintel dataset, it achieves TEPE scores of 0.62 (clean) and 1.11 (final), outperforming BiDAStereo by 17.3% and 9.02%, respectively, while reducing FLOPs by 12.4%. The method is particularly suited for latency-sensitive, temporally demanding applications such as augmented reality.

Technology Category

Application Category

πŸ“ Abstract
Temporally consistent depth estimation from stereo video is critical for real-world applications such as augmented reality, where inconsistent depth estimation disrupts the immersion of users. Despite its importance, this task remains challenging due to the difficulty in modeling long-term temporal consistency in a computationally efficient manner. Previous methods attempt to address this by aggregating spatio-temporal information but face a fundamental trade-off: limited temporal modeling provides only modest gains, whereas capturing long-range dependencies significantly increases computational cost. To address this limitation, we introduce a memory buffer for modeling long-range spatio-temporal consistency while achieving efficient dynamic stereo matching. Inspired by the two-stage decision-making process in humans, we propose a extbf{P}ick-and- extbf{P}lay extbf{M}emory (PPM) construction module for dynamic extbf{Stereo} matching, dubbed as extbf{PPMStereo}. PPM consists of a `pick' process that identifies the most relevant frames and a `play' process that weights the selected frames adaptively for spatio-temporal aggregation. This two-stage collaborative process maintains a compact yet highly informative memory buffer while achieving temporally consistent information aggregation. Extensive experiments validate the effectiveness of PPMStereo, demonstrating state-of-the-art performance in both accuracy and temporal consistency. % Notably, PPMStereo achieves 0.62/1.11 TEPE on the Sintel clean/final (17.3% & 9.02% improvements over BiDAStereo) with fewer computational costs. Codes are available at extcolor{blue}{https://github.com/cocowy1/PPMStereo}.
Problem

Research questions and friction points this paper is trying to address.

Achieving temporally consistent depth estimation from stereo video
Modeling long-term temporal consistency with computational efficiency
Resolving trade-off between limited temporal modeling and high computational cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pick-and-Play Memory module for stereo matching
Two-stage process selects and weights relevant frames
Compact memory achieves efficient long-range consistency
πŸ”Ž Similar Papers
No similar papers found.
Y
Yun Wang
City University of Hong Kong
J
Junjie Hu
The Chinese University of Hong Kong, Shenzhen
Qiaole Dong
Qiaole Dong
Fudan University
Computer Vision
Y
Yongjian Zhang
Shenzhen Campus, Sun Yat-sen University
Yanwei Fu
Yanwei Fu
Fudan University
Computer visionmachine learningMultimedia
T
Tin Lun Lam
The Chinese University of Hong Kong, Shenzhen
Dapeng Wu
Dapeng Wu
Chongqing University of Posts and Telecommunications
Wireless NetworkSocial Computing