AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing agricultural multimodal large models, which suffer from a "ground-centric" bias that leads to scale confusion and logical drift when interpreting cross-scale imagery (ground, drone, and satellite), thereby hindering complex agricultural planning. To overcome this, we introduce AgroOmni—the first large-scale, multi-view agricultural multimodal dataset comprising 288K samples—and propose AgroNVILA, a novel model featuring a perception-reasoning disentangled architecture. AgroNVILA incorporates a View-Conditioned Meta-Network (VCMN) to inject macro-scale spatial context, resolving scale ambiguity, and integrates an Agriculture-aware Perception-guided Relative Policy Optimization (ARPO) reinforcement learning mechanism to prevent reliance on statistical shortcuts. Experiments demonstrate that our approach improves accuracy by 15.18% over state-of-the-art models in multi-altitude agricultural reasoning tasks, significantly enhancing holistic agricultural spatial planning capabilities.

Technology Category

Application Category

📝 Abstract
Agricultural multimodal reasoning requires robust spatial understanding across varying scales, from ground-level close-ups to top-down UAV and satellite imagery. Existing Multi-modal Large Language Models (MLLMs) suffer from a significant "terrestrial-centric" bias, causing scale confusion and logic drift during complex agricultural planning. To address this, we introduce the first large-scale AgroOmni (288K), a multi-view training corpus designed to capture diverse spatial topologies and scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, an MLLM that utilizes a novel Perception-Reasoning Decoupling (PRD) architecture. On the perception side, we incorporate a View-Conditioned Meta-Net (VCMN), which injects macroscopic spatial context into visual tokens, resolving scale ambiguities with minimal computational overhead. On the reasoning side, Agriculture-aware Relative Policy Optimization (ARPO) leverages reinforcement learning to align the model's decision-making with expert agricultural logic, preventing statistical shortcuts. Extensive experiments demonstrate that AgroNVILA outperforms state-of-the-art MLLMs, achieving significant improvements (+15.18%) in multi-altitude agricultural reasoning, reflecting its robust capability for holistic agricultural spatial planning.
Problem

Research questions and friction points this paper is trying to address.

agricultural multimodal reasoning
scale confusion
terrestrial-centric bias
multi-view perception
spatial understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perception-Reasoning Decoupling
View-Conditioned Meta-Net
Agriculture-aware Relative Policy Optimization
Multi-view Agricultural MLLM
AgroOmni
🔎 Similar Papers
No similar papers found.
J
Jiarui Zhang
Sun Yat-sen University
J
Junqi Hu
Sun Yat-sen University
Z
Zurong Mai
Sun Yat-sen University
Y
Yuhang Chen
Sun Yat-sen University
S
Shuohong Lou
Sun Yat-sen University
H
Henglian Huang
Sun Yat-sen University
L
Lingyuan Zhao
HuanTian Wisdom Technology Co., Ltd.
Jianxi Huang
Jianxi Huang
Professor in China Agricultural University
Data assimilationClimate changeAgricultural remote sensingCrop modeling with remote sensing data assimilationCrop yield
Y
Yutong Lu
Sun Yat-sen University
Haohuan Fu
Haohuan Fu
Tsinghua University
J
Juepeng Zheng
Sun Yat-sen University