AgroNVILA: Perception-Reasoning Decoupling for Multi-view Agricultural Multimodal Large Language Models

📅 2026-03-15

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the limitations of existing agricultural multimodal large models, which suffer from a "ground-centric" bias that leads to scale confusion and logical drift when interpreting cross-scale imagery (ground, drone, and satellite), thereby hindering complex agricultural planning. To overcome this, we introduce AgroOmni—the first large-scale, multi-view agricultural multimodal dataset comprising 288K samples—and propose AgroNVILA, a novel model featuring a perception-reasoning disentangled architecture. AgroNVILA incorporates a View-Conditioned Meta-Network (VCMN) to inject macro-scale spatial context, resolving scale ambiguity, and integrates an Agriculture-aware Perception-guided Relative Policy Optimization (ARPO) reinforcement learning mechanism to prevent reliance on statistical shortcuts. Experiments demonstrate that our approach improves accuracy by 15.18% over state-of-the-art models in multi-altitude agricultural reasoning tasks, significantly enhancing holistic agricultural spatial planning capabilities.

Technology Category

Application Category

📝 Abstract

Agricultural multimodal reasoning requires robust spatial understanding across varying scales, from ground-level close-ups to top-down UAV and satellite imagery. Existing Multi-modal Large Language Models (MLLMs) suffer from a significant "terrestrial-centric" bias, causing scale confusion and logic drift during complex agricultural planning. To address this, we introduce the first large-scale AgroOmni (288K), a multi-view training corpus designed to capture diverse spatial topologies and scales in modern precision agriculture. Built on this dataset, we propose AgroNVILA, an MLLM that utilizes a novel Perception-Reasoning Decoupling (PRD) architecture. On the perception side, we incorporate a View-Conditioned Meta-Net (VCMN), which injects macroscopic spatial context into visual tokens, resolving scale ambiguities with minimal computational overhead. On the reasoning side, Agriculture-aware Relative Policy Optimization (ARPO) leverages reinforcement learning to align the model's decision-making with expert agricultural logic, preventing statistical shortcuts. Extensive experiments demonstrate that AgroNVILA outperforms state-of-the-art MLLMs, achieving significant improvements (+15.18%) in multi-altitude agricultural reasoning, reflecting its robust capability for holistic agricultural spatial planning.

Problem

Research questions and friction points this paper is trying to address.

agricultural multimodal reasoning

scale confusion

terrestrial-centric bias

multi-view perception

spatial understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Perception-Reasoning Decoupling

View-Conditioned Meta-Net

Agriculture-aware Relative Policy Optimization