🤖 AI Summary
Existing image inpainting Transformers suffer from degraded convergence and generalization due to attention mechanisms that break translation equivariance. To address this, we propose TEAFormer—an Adaptive Translation-Equivariant Transformer—that introduces an adaptive sliding-index mechanism to dynamically select key-value pairs while incorporating globally aggregated information, thereby preserving strict translation equivariance without compromising between fixed receptive fields and computational overhead. TEAFormer further integrates sliding-window attention with stackable equivariant components to form an efficient, scalable equivariant architecture. Experiments demonstrate that TEAFormer significantly accelerates training convergence (1.8× faster on average) across diverse image restoration tasks and achieves state-of-the-art generalization performance under cross-dataset evaluation, validating the critical importance of explicit translation-equivariant modeling for low-level vision tasks.
📝 Abstract
Translation equivariance is a fundamental inductive bias in image restoration, ensuring that translated inputs produce translated outputs. Attention mechanisms in modern restoration transformers undermine this property, adversely impacting both training convergence and generalization. To alleviate this issue, we propose two key strategies for incorporating translation equivariance: slide indexing and component stacking. Slide indexing maintains operator responses at fixed positions, with sliding window attention being a notable example, while component stacking enables the arrangement of translation-equivariant operators in parallel or sequentially, thereby building complex architectures while preserving translation equivariance. However, these strategies still create a dilemma in model design between the high computational cost of self-attention and the fixed receptive field associated with sliding window attention. To address this, we develop an adaptive sliding indexing mechanism to efficiently select key-value pairs for each query, which are then concatenated in parallel with globally aggregated key-value pairs. The designed network, called the Translation Equivariance Adaptive Transformer (TEAFormer), is assessed across a variety of image restoration tasks. The results highlight its superiority in terms of effectiveness, training convergence, and generalization.