🤖 AI Summary
Manual architecture design for object detection is inefficient, neural architecture search (NAS) incurs prohibitive computational overhead, and existing large language model (LLM)-based methods act merely as iterative optimizers without capturing intrinsic data characteristics. Method: This paper proposes the first end-to-end, LLM-driven architecture generation framework grounded in the “first principles” of data. It extracts meta-features—such as object scale distribution and scene density—and leverages retrieval-augmented generation (RAG)-enhanced LLM reasoning to directly synthesize executable Neural Architecture Description Language (NADL) code, compiled into deployable models via a dedicated compiler. Contribution/Results: The framework eliminates traditional search loops and black-box tuning, establishing a closed-loop pipeline: meta-feature analysis → LLM-based generation → compilation. Evaluated on five mainstream detection benchmarks, generated architectures outperform strong baselines (e.g., YOLOv8/v10) with significantly reduced parameter counts. Ablation studies confirm that data-driven LLM inference is the key driver of performance gains.
📝 Abstract
Designing high-performance object detection architectures is a complex task, where traditional manual design is time-consuming and labor-intensive, and Neural Architecture Search (NAS) is computationally prohibitive. While recent approaches using Large Language Models (LLMs) show promise, they often function as iterative optimizers within a search loop, rather than generating architectures directly from a holistic understanding of the data. To address this gap, we propose Cognitive-YOLO, a novel framework for LLM-driven architecture synthesis that generates network configurations directly from the intrinsic characteristics of the dataset. Our method consists of three stages: first, an analysis module extracts key meta-features (e.g., object scale distribution and scene density) from the target dataset; second, the LLM reasons upon these features, augmented with state-of-the-art components retrieved via Retrieval-Augmented Generation (RAG), to synthesize the architecture into a structured Neural Architecture Description Language (NADL); finally, a compiler instantiates this description into a deployable model. Extensive experiments on five diverse object detection datasets demonstrate that our proposed Cognitive-YOLO consistently generates superior architectures, achieving highly competitive performance and demonstrating a superior performance-per-parameter trade-off compared to strong baseline models across multiple benchmarks. Crucially, our ablation studies prove that the LLM's data-driven reasoning is the primary driver of performance, demonstrating that a deep understanding of data "first principles" is more critical for achieving a superior architecture than simply retrieving SOTA components.