🤖 AI Summary
To address poor occlusion robustness, semantic ambiguity, and class imbalance caused by static fusion and class-agnostic attention in dynamic scenes, this paper proposes the Dynamic Class-Aware Fusion Network (DCF-Net). Methodologically, DCF-Net introduces: (1) a feature-enhancement neck employing input-conditioned balanced modeling and implicit fixed-point iterative optimization; (2) a dual-path channel-spatial dynamic attention mechanism jointly modeling input distribution and class semantics; and (3) class-aware feature adaptive modulation to explicitly strengthen representations of rare classes. The architecture is plug-and-play compatible with mainstream detectors (e.g., YOLOv8) and supports end-to-end training. Evaluated on 13 benchmark datasets, DCF-Net surpasses nine state-of-the-art methods, achieving substantial gains in mAP@50 and mAP@50–95, while maintaining only 11.1M parameters—demonstrating an exceptional balance among accuracy, computational efficiency, and deployment friendliness.
📝 Abstract
Recent advancements in object detection rely on modular architectures with multi-scale fusion and attention mechanisms. However, static fusion heuristics and class-agnostic attention limit performance in dynamic scenes with occlusions, clutter, and class imbalance. We introduce Dynamic Class-Aware Fusion Network (DyCAF-Net) that addresses these challenges through three innovations: (1) an input-conditioned equilibrium-based neck that iteratively refines multi-scale features via implicit fixed-point modeling, (2) a dual dynamic attention mechanism that adaptively recalibrates channel and spatial responses using input- and class-dependent cues, and (3) class-aware feature adaptation that modulates features to prioritize discriminative regions for rare classes. Through comprehensive ablation studies with YOLOv8 and related architectures, alongside benchmarking against nine state-of-the-art baselines, DyCAF-Net achieves significant improvements in precision, mAP@50, and mAP@50-95 across 13 diverse benchmarks, including occlusion-heavy and long-tailed datasets. The framework maintains computational efficiency ($sim$11.1M parameters) and competitive inference speeds, while its adaptability to scale variance, semantic overlaps, and class imbalance positions it as a robust solution for real-world detection tasks in medical imaging, surveillance, and autonomous systems.