Rapid Salient Object Detection with Difference Convolutional Neural Networks

📅 2025-07-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-time salient object detection (SOD) on resource-constrained devices faces a fundamental trade-off between accuracy and efficiency. To address this, we propose two lightweight networks: SDNet for images and STDNet for videos. Our key innovations include (1) pixel-wise differential convolution (PDC) and differential convolution reparameterization (DCR), which integrate multi-scale contrastive priors at zero inference overhead; and (2) spatio-temporal differential convolution (STDC), which explicitly models inter-frame temporal consistency in video sequences. Both models contain fewer than 1 million parameters and achieve 46 FPS (image) and 150 FPS (video) on a Jetson Orin platform—more than twice the speed of state-of-the-art lightweight alternatives—while maintaining superior accuracy. To our knowledge, this is the first work to jointly optimize high frame rate, high accuracy, and low power consumption for real-time SOD on edge devices.

Technology Category

Application Category

📝 Abstract
This paper addresses the challenge of deploying salient object detection (SOD) on resource-constrained devices with real-time performance. While recent advances in deep neural networks have improved SOD, existing top-leading models are computationally expensive. We propose an efficient network design that combines traditional wisdom on SOD and the representation power of modern CNNs. Like biologically-inspired classical SOD methods relying on computing contrast cues to determine saliency of image regions, our model leverages Pixel Difference Convolutions (PDCs) to encode the feature contrasts. Differently, PDCs are incorporated in a CNN architecture so that the valuable contrast cues are extracted from rich feature maps. For efficiency, we introduce a difference convolution reparameterization (DCR) strategy that embeds PDCs into standard convolutions, eliminating computation and parameters at inference. Additionally, we introduce SpatioTemporal Difference Convolution (STDC) for video SOD, enhancing the standard 3D convolution with spatiotemporal contrast capture. Our models, SDNet for image SOD and STDNet for video SOD, achieve significant improvements in efficiency-accuracy trade-offs. On a Jetson Orin device, our models with $<$ 1M parameters operate at 46 FPS and 150 FPS on streamed images and videos, surpassing the second-best lightweight models in our experiments by more than $2 imes$ and $3 imes$ in speed with superior accuracy. Code will be available at https://github.com/hellozhuo/stdnet.git.
Problem

Research questions and friction points this paper is trying to address.

Efficient salient object detection on resource-limited devices
Reducing computational cost while maintaining accuracy
Real-time performance for both image and video SOD
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Pixel Difference Convolutions for feature contrasts
Employs Difference Convolution Reparameterization for efficiency
Introduces SpatioTemporal Difference Convolution for video
🔎 Similar Papers
No similar papers found.