Large Models in Dialogue for Active Perception and Anomaly Detection

๐Ÿ“… 2025-01-27
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address autonomous aerial monitoring in remote, unknown environments inaccessible to humans, this paper proposes a novel LLM-VQAโ€“driven active perception paradigm. Methodologically, it establishes a dual-model dialogue framework: a large language model (LLM) performs embodied reasoning and generates dynamic flight control policies, while a vision-language model (VQA) conducts real-time multimodal visual question answering and anomaly localization; the two models close the perception-decision loop via natural language interaction, and NL2Code translation converts linguistic instructions into executable flight code. The key contribution lies in the first deep integration of LLMsโ€™ linguistic capabilities into embodied drone exploration, overcoming limitations of conventional static visual analysis. Evaluated in a high-fidelity simulation environment, the framework achieves significant improvements in cross-scenario anomaly detection accuracy (+18.7%) and fine-grained semantic understanding, enabling real-time hazardous object identification and alerting.

Technology Category

Application Category

๐Ÿ“ Abstract
Autonomous aerial monitoring is an important task aimed at gathering information from areas that may not be easily accessible by humans. At the same time, this task often requires recognizing anomalies from a significant distance or not previously encountered in the past. In this paper, we propose a novel framework that leverages the advanced capabilities provided by Large Language Models (LLMs) to actively collect information and perform anomaly detection in novel scenes. To this end, we propose an LLM based model dialogue approach, in which two deep learning models engage in a dialogue to actively control a drone to increase perception and anomaly detection accuracy. We conduct our experiments in a high fidelity simulation environment where an LLM is provided with a predetermined set of natural language movement commands mapped into executable code functions. Additionally, we deploy a multimodal Visual Question Answering (VQA) model charged with the task of visual question answering and captioning. By engaging the two models in conversation, the LLM asks exploratory questions while simultaneously flying a drone into different parts of the scene, providing a novel way to implement active perception. By leveraging LLMs reasoning ability, we output an improved detailed description of the scene going beyond existing static perception approaches. In addition to information gathering, our approach is utilized for anomaly detection and our results demonstrate the proposed methods effectiveness in informing and alerting about potential hazards.
Problem

Research questions and friction points this paper is trying to address.

Autonomous Aerial Surveillance
Information Collection
Anomaly Detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Proactive Information Seeking
Anomaly Detection
๐Ÿ”Ž Similar Papers
No similar papers found.
T
Tzoulio Chamiti
Computational Intelligence and Deep Learning Group, AIIA Lab., Dept. of Informatics, Aristotle University of Thessaloniki, Thessaloniki 541 24, Greece
N
N. Passalis
Dept. of Chemical Engineering, Aristotle University of Thessaloniki, Thessaloniki 541 24, Greece
Anastasios Tefas
Anastasios Tefas
Professor, Department of Informatics, Aristotle University of Thessaloniki
Computational IntelligenceDeep LearningRoboticsMachine LearningComputer Vision