CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
Existing vision-language models suffer significant performance degradation in cross-view object detection—such as between ground-level and aerial perspectives—due to geometric discrepancies and varying scene complexity. This work proposes CrossVL, a novel framework that introduces scene complexity estimation into this task for the first time. CrossVL features a Complexity-aware Path Aggregation (CPA) mechanism that enables adaptive multi-path feature fusion and incorporates a Paired Curriculum Learning (PCL) strategy grounded in semantic consistency to optimize training dynamics. Evaluated on the MAVREC dataset, CrossVL boosts the aerial mAP of Florence-2 to 61.03%, narrows the performance gap between ground and aerial views to just 6.65 percentage points, and reduces variance across random seeds by a factor of 3.3.
📝 Abstract
Vision-language models (VLMs) enable text-guided object detection but degrade severely under cross-view scenarios where ground and aerial viewpoints differ in altitude, scale, and spatial layout. These geometric changes introduce systematic complexity variations between viewpoints, e.g., ground view images contain dense and highly occluded structures, while aerial images are sparse and globally organized. Fixed VLM fusion mechanisms cannot handle this discrepancy. We propose CrossVL, a framework combining Complexity-Aware Pathway Aggregation (CPA) and Paired Curriculum Learning (PCL) for enhanced cross-view detection for VLM. CPA estimates scene complexity from multimodal statistics and routes visual features through multiple pathways to obtain view-specific representations. PCL leverages semantic consistency of synchronized ground-aerial pairs to provide stable early supervision and then gradually shifts toward randomized sampling. On MAVREC, CrossVL improves Florence-2's aerial mAP from 58.66% to 61.03% and reduces the ground-aerial performance gap from 8.63pp to 6.65pp, while also achieving a 3.3x reduction in variance across random seeds. CPA provides stable complexity-aware feature aggregation, and PCL enhances optimization dynamics. Together, they demonstrate that coordinated architectural and training adaptations are crucial for robust cross-view VLM detection.
Problem

Research questions and friction points this paper is trying to address.

cross-view detection
vision-language models
scene complexity
feature fusion
geometric discrepancy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Complexity-Aware Feature Routing
Paired Curriculum Learning
Cross-View Detection
Vision-Language Models
Pathway Aggregation