🤖 AI Summary
Existing salient object detection (SOD) methods rely on complex multi-stage architectures, suffering from feature redundancy and cross-module interference, which hinder performance. In contrast, the human visual system achieves efficient saliency perception through elegant, biologically grounded mechanisms. To bridge this gap, we propose DualGazeNet—the first pure Transformer-based SOD model inspired by the dual visual pathways (magnocellular and parvocellular) in biological vision. DualGazeNet eliminates dedicated fusion modules and multi-stage designs, instead achieving multi-scale feature integration and precise boundary localization via dual attention queries and cortical attention modulation. Evaluated on five mainstream RGB benchmarks, it outperforms 25 state-of-the-art methods, achieves a 60% inference speedup, reduces FLOPs by 53.4%, and demonstrates strong cross-domain generalization—e.g., in camouflaged and underwater scenes. DualGazeNet thus delivers superior accuracy, efficiency, and interpretability in a unified, biologically plausible framework.
📝 Abstract
Recent salient object detection (SOD) methods aim to improve performance in four key directions: semantic enhancement, boundary refinement, auxiliary task supervision, and multi-modal fusion. In pursuit of continuous gains, these approaches have evolved toward increasingly sophisticated architectures with multi-stage pipelines, specialized fusion modules, edge-guided learning, and elaborate attention mechanisms. However, this complexity paradoxically introduces feature redundancy and cross-component interference that obscure salient cues, ultimately reaching performance bottlenecks. In contrast, human vision achieves efficient salient object identification without such architectural complexity. This contrast raises a fundamental question: can we design a biologically grounded yet architecturally simple SOD framework that dispenses with most of this engineering complexity, while achieving state-of-the-art accuracy, computational efficiency, and interpretability? In this work, we answer this question affirmatively by introducing DualGazeNet, a biologically inspired pure Transformer framework that models the dual biological principles of robust representation learning and magnocellular-parvocellular dual-pathway processing with cortical attention modulation in the human visual system. Extensive experiments on five RGB SOD benchmarks show that DualGazeNet consistently surpasses 25 state-of-the-art CNN- and Transformer-based methods. On average, DualGazeNet achieves about 60% higher inference speed and 53.4% fewer FLOPs than four Transformer-based baselines of similar capacity (VST++, MDSAM, Sam2unet, and BiRefNet). Moreover, DualGazeNet exhibits strong cross-domain generalization, achieving leading or highly competitive performance on camouflaged and underwater SOD benchmarks without relying on additional modalities.