Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This study addresses the challenge of fine-grained recognition of military vehicles in synthetic aperture radar (SAR) imagery under complex environments by introducing large language-vision models to the SAR automatic target recognition domain for the first time. Leveraging the MSTAR dataset, the authors construct the first vision-language benchmark tailored for military remote sensing, encompassing visual question answering (VQA) and image captioning tasks. Building upon Transformer-based architectures such as CLIP and LLaVA, the proposed approach employs parameter-efficient fine-tuning strategies during training. The method achieves a 98% accuracy rate on fine-grained target attribute recognition, substantially enhancing identification performance in complex scenarios and establishing a novel paradigm for machine-assisted intelligence analysis.

📝 Abstract

Large language-vision models (LLVM), such as OpenAI's ChatGPT and GPT-4, have gained prominence as powerful tools for analyzing text and imagery. The merging of these data domains represents a significant paradigm shift with far-reaching implications for automatic target recognition (ATR). Recent transformer-based LLVM research has shown substantial improvements for geospatial perception tasks. Our study examines the application of LLVM to remote sensing image captioning and visual question-answering (VQA), with a specific focus on synthetic aperture radar (SAR) imagery. We examine newly published LLVM methods, including CLIP and LLaVA neural network transformer architectures. We have developed a work-in-progress SAR training and evaluation benchmark derived from the MSTAR Public Dataset. This has been extended to include descriptive text captions and question-answer pairs for VQA tasks. This challenge dataset is designed to push the boundaries of an LLVM in identifying nuanced ATR details in SAR imagery. Utilizing parameter-efficient fine-tuning, we train an LLVM method to identify fine-grained target qualities at 98% accuracy. We detail our data setup and experiments, addressing potential pitfalls that could lead to misleading conclusions. Accurately identifying and differentiating military vehicle types in SAR data poses a critical challenge, especially under complex environmental conditions. Mastering this target recognition skill may require a human analyst months of training and years of practice. This research represents a unique effort to apply LLVM to SAR applications, advancing machine-assisted remote sensing ATR for military and intelligence contexts.

Problem

Research questions and friction points this paper is trying to address.

Automatic Target Recognition

Synthetic Aperture Radar

Military Vehicle Classification

Remote Sensing

Visual Question Answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language-Vision Models

Synthetic Aperture Radar

Visual Question Answering