Synthesizing the Kill Chain: A Zero-Shot Framework for Target Verification and Tactical Reasoning on the Edge

๐Ÿ“… 2026-02-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This study addresses the challenges of reliable target verification and tactical reasoning for edge-based autonomous robots in dynamic military environments, where limited training data and computational resources hinder performance. The authors propose a hierarchical zero-shot framework that decouples perception from reasoning via a โ€œcontrolled inputโ€ mechanism: high-recall region proposals are generated using Grounding DINO, followed by semantic validation with lightweight vision-language models (Qwen/Gemma, 4Bโ€“12B parameters), forming a Scout-Commander agent workflow. This work presents the first systematic analysis of failure modes across varying VLM scales in safety-critical tasks, achieving 100% accuracy in false-positive filtering, 97.5% in battle damage assessment, 55โ€“90% in fine-grained vehicle classification, 100% correctness in asset deployment, a tactical reasoning score of 9.8/10, and end-to-end latency under 75 seconds on 55 synthetic video sequences.

Technology Category

Application Category

๐Ÿ“ Abstract
Deploying autonomous edge robotics in dynamic military environments is constrained by both scarce domain-specific training data and the computational limits of edge hardware. This paper introduces a hierarchical, zero-shot framework that cascades lightweight object detection with compact Vision-Language Models (VLMs) from the Qwen and Gemma families (4B-12B parameters). Grounding DINO serves as a high-recall, text-promptable region proposer, and frames with high detection confidence are passed to edge-class VLMs for semantic verification. We evaluate this pipeline on 55 high-fidelity synthetic videos from Battlefield 6 across three tasks: false-positive filtering (up to 100% accuracy), damage assessment (up to 97.5%), and fine-grained vehicle classification (55-90%). We further extend the pipeline into an agentic Scout-Commander workflow, achieving 100% correct asset deployment and a 9.8/10 reasoning score (graded by GPT-4o) with sub-75-second latency. A novel"Controlled Input"methodology decouples perception from reasoning, revealing distinct failure phenotypes: Gemma3-12B excels at tactical logic but fails in visual perception, while Gemma3-4B exhibits reasoning collapse even with accurate inputs. These findings validate hierarchical zero-shot architectures for edge autonomy and provide a diagnostic framework for certifying VLM suitability in safety-critical applications.
Problem

Research questions and friction points this paper is trying to address.

edge robotics
zero-shot learning
target verification
tactical reasoning
Vision-Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot reasoning
edge autonomy
Vision-Language Models (VLMs)
Controlled Input methodology
hierarchical perception-reasoning
๐Ÿ”Ž Similar Papers
No similar papers found.