PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
This work addresses the challenges of semantic segmentation in unmanned aerial vehicle (UAV) remote sensing imagery, which suffers from oblique viewpoints, ultra-high resolution, and extreme scale variations. To this end, we formally define the UAV reasoning-aware segmentation task for the first time and decompose its semantic requirements into three reasoning dimensions: spatial, attribute, and scene. We introduce DRSeg, a large-scale benchmark dataset comprising 10,000 high-resolution aerial images annotated with Chain-of-Thought question-answer pairs, and propose PixDLM, a lightweight yet effective dual-path pixel-level multimodal language model as a unified baseline. Experimental results demonstrate that PixDLM establishes a strong performance baseline on DRSeg, effectively tackling the unique challenges of UAV reasoning-aware segmentation and providing a reliable foundation for future research in both data and methodology.

Technology Category

Application Category

📝 Abstract
Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.
Problem

Research questions and friction points this paper is trying to address.

UAV Reasoning Segmentation
Oblique Viewpoints
Ultra-high Resolution
Scale Variations
Multimodal Language Model
Innovation

Methods, ideas, or system contributions that make the work stand out.

UAV Reasoning Segmentation
Multimodal Language Model
Chain-of-Thought Supervision
Pixel-level Reasoning
Remote Sensing Benchmark
🔎 Similar Papers
No similar papers found.
S
Shuyan Ke
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Y
Yifan Mei
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
C
Changli Wu
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China; Shanghai Innovation Institute
Y
Yonghan Zheng
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Jiayi Ji
Jiayi Ji
Rutgers University
L
Liujuan Cao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China