AnyDepth-DETR/-YOLO: Any-depth object detection with a single network

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the inherent limitation of conventional object detection models, whose fixed depth impedes flexible trade-offs between accuracy and efficiency. The authors propose a depth-flexible object detection framework that enables continuous accuracy–efficiency balancing by dynamically adjusting network depth during inference without requiring retraining. The architecture comprises a mandatory backbone path and a skippable refinement path, augmented with a self-distillation mechanism between extreme subnetworks. To ensure multi-scale feature consistency and support modular training, the method incorporates path decomposition along with prediction-level and feature-level alignment losses. Implemented on RT-DETR and YOLOv12, the framework achieves state-of-the-art performance across all depth configurations; its lightest variant accelerates inference by 1.82× with only a 2.0 AP drop, while incurring negligible parameter overhead.

📝 Abstract

Modern object detectors are static, fixed-depth networks optimized for a single operating point, requiring separate models for different deployment scenarios. We present an any-depth detection framework that enables a single network to span a continuous range of accuracy--efficiency trade-offs by controlling depth at inference time without retraining. Each backbone and neck stage is divided into an essential path, which always executes, and a skippable refinement path; this decomposition preserves the full multi-scale feature hierarchy at every depth configuration, unlike conventional early exiting that discards entire stages. To train such a network, jointly optimizing many sub-networks of varying depth introduces conflicting gradient signals. We address this via self-distillation between only the two extremes, with prediction-level and feature-level alignment losses that enforce stage-wise modularity, ensuring the outputs of each stage remain compatible regardless of the paths taken. Instantiated on RT-DETR and YOLOv12, our full-depth configurations match or surpass their respective SOTA baselines with negligible parameter overhead, while the most efficient configurations achieve up to $1.82\times$ speedup at a cost of only 2.0 AP, all from a single set of weights.

Problem

Research questions and friction points this paper is trying to address.

any-depth

object detection

single network

accuracy-efficiency trade-off

dynamic depth

Innovation

Methods, ideas, or system contributions that make the work stand out.

any-depth detection

dynamic inference

self-distillation