🤖 AI Summary
Addressing the dual challenges of poor feature consistency and low computational efficiency in multi-scale object detection for autonomous driving, this paper proposes Butter, a novel detection framework. Methodologically, Butter introduces (1) a frequency-adaptive filtering module that dynamically optimizes structural and boundary consistency of multi-scale features in the frequency domain, and (2) a progressive hierarchical feature fusion network that effectively bridges semantic gaps and enhances cross-level feature representation. The architecture synergistically integrates CNN and Transformer components and is trained end-to-end on BDD100K, KITTI, and Cityscapes. Experimental results demonstrate that Butter achieves a significant accuracy improvement (+3.2% mAP) while maintaining real-time inference capability, alongside reduced model complexity (18% fewer FLOPs) and parameter count. The source code is publicly available.
📝 Abstract
Hierarchical feature representations play a pivotal role in computer vision, particularly in object detection for autonomous driving. Multi-level semantic understanding is crucial for accurately identifying pedestrians, vehicles, and traffic signs in dynamic environments. However, existing architectures, such as YOLO and DETR, struggle to maintain feature consistency across different scales while balancing detection precision and computational efficiency. To address these challenges, we propose Butter, a novel object detection framework designed to enhance hierarchical feature representations for improving detection robustness. Specifically, Butter introduces two key innovations: Frequency-Adaptive Feature Consistency Enhancement (FAFCE) Component, which refines multi-scale feature consistency by leveraging adaptive frequency filtering to enhance structural and boundary precision, and Progressive Hierarchical Feature Fusion Network (PHFFNet) Module, which progressively integrates multi-level features to mitigate semantic gaps and strengthen hierarchical feature learning. Through extensive experiments on BDD100K, KITTI, and Cityscapes, Butter demonstrates superior feature representation capabilities, leading to notable improvements in detection accuracy while reducing model complexity. By focusing on hierarchical feature refinement and integration, Butter provides an advanced approach to object detection that achieves a balance between accuracy, deployability, and computational efficiency in real-time autonomous driving scenarios. Our model and implementation are publicly available at https://github.com/Aveiro-Lin/Butter, facilitating further research and validation within the autonomous driving community.