🤖 AI Summary
To address the limitations of local stereo matching methods—lacking global consistency—and global methods—suffering from excessive computational cost—this paper proposes an efficient, fine-tuning-free depth estimation model that generalizes across resolutions and disparity ranges. The method introduces a multi-resolution Transformer architecture incorporating sparse attention to significantly reduce memory consumption; a probabilistic focusing loss that jointly optimizes disparity, occlusion, and confidence maps; and eliminates both cost volume filtering and deep refinement networks. Evaluated on Middlebury v3 and ETH3D benchmarks, the model achieves state-of-the-art accuracy—substantially outperforming existing approaches—while maintaining high computational efficiency and superior detail reconstruction capability.
📝 Abstract
The pursuit of a generalizable stereo matching model, capable of performing across varying resolutions and disparity ranges without dataset-specific fine-tuning, has revealed a fundamental trade-off. Iterative local search methods achieve high scores on constrained benchmarks, but their core mechanism inherently limits the global consistency required for true generalization. On the other hand, global matching architectures, while theoretically more robust, have been historically rendered infeasible by prohibitive computational and memory costs. We resolve this dilemma with $S^2M^2$: a global matching architecture that achieves both state-of-the-art accuracy and high efficiency without relying on cost volume filtering or deep refinement stacks. Our design integrates a multi-resolution transformer for robust long-range correspondence, trained with a novel loss function that concentrates probability on feasible matches. This approach enables a more robust joint estimation of disparity, occlusion, and confidence. $S^2M^2$ establishes a new state of the art on the Middlebury v3 and ETH3D benchmarks, significantly outperforming prior methods across most metrics while reconstructing high-quality details with competitive efficiency.