π€ AI Summary
Conventional decoders for dense prediction tasks suffer from outdated architectural designs and insufficient cross-layer contextual sharing, limiting feature propagation efficiency and spatial consistency. Method: We propose a novel decoder architecture centered on a learnable, shared βbankββa parameterized module dynamically resampled and fused across multiple scales to enable explicit cross-layer contextual reuse during decoding, thereby departing from traditional serial, layer-wise independent decoding paradigms. Built upon a Transformer backbone, the bank is jointly optimized end-to-end. Contribution/Results: Our approach significantly improves decoding efficiency and spatial coherence. On both natural and synthetic image depth estimation benchmarks, it substantially outperforms state-of-the-art methods, achieving superior accuracy and generalization under large-scale training. To our knowledge, this work presents the first systematic design and empirical validation of a universal, decoder-level contextual sharing mechanism.
π Abstract
Dense prediction tasks have enjoyed a growing complexity of encoder architectures, decoders, however, have remained largely the same. They rely on individual blocks decoding intermediate feature maps sequentially. We introduce banks, shared structures that are used by each decoding block to provide additional context in the decoding process. These structures, through applying them via resampling and feature fusion, improve performance on depth estimation for state-of-the-art transformer-based architectures on natural and synthetic images whilst training on large-scale datasets.