🤖 AI Summary
Traditional object detection is constrained by predefined categories, limiting generalization to unknown objects. While open-world object detection (OWOD) and open-vocabulary object detection (OVOD) improve flexibility, OWOD lacks semantic labels for unknown classes, and OVOD relies on handcrafted prompts, compromising autonomy. This paper proposes LAOD, the first framework to decouple object localization from semantic naming: a large language model (LLM) autonomously generates scene-aware zero-shot class names, which are then used by an open-vocabulary detector for localization. To evaluate this paradigm, we introduce two novel metrics—Classification-Aware Average Precision (CAAP) for localization accuracy and Semantic Naming Average Precision (SNAP) for naming fidelity. Experiments on LVIS, COCO, and COCO-OOD demonstrate that LAOD significantly improves end-to-end detection and interpretable naming of unknown objects, enhancing autonomous adaptability in open-world environments.
📝 Abstract
Object detection traditionally relies on fixed category sets, requiring costly re-training to handle novel objects. While Open-World and Open-Vocabulary Object Detection (OWOD and OVOD) improve flexibility, OWOD lacks semantic labels for unknowns, and OVOD depends on user prompts, limiting autonomy. We propose an LLM-guided agentic object detection (LAOD) framework that enables fully label-free, zero-shot detection by prompting a Large Language Model (LLM) to generate scene-specific object names. These are passed to an open-vocabulary detector for localization, allowing the system to adapt its goals dynamically. We introduce two new metrics, Class-Agnostic Average Precision (CAAP) and Semantic Naming Average Precision (SNAP), to separately evaluate localization and naming. Experiments on LVIS, COCO, and COCO-OOD validate our approach, showing strong performance in detecting and naming novel objects. Our method offers enhanced autonomy and adaptability for open-world understanding.