🤖 AI Summary
This work addresses the limited high-level reasoning capabilities of existing agricultural robots, which hinder their effectiveness in complex crop monitoring tasks. The authors propose a modular task planning framework that leverages a vision-language model (VLM) to guide horticultural robots through interleaved visual–language queries and action primitives for intelligent decision-making. They establish the first benchmark for short- and long-term crop monitoring in both monoculture and polyculture environments and uncover a critical limitation: VLM performance degrades significantly in long-horizon tasks due to noise in semantic maps. Experimental results show that the approach achieves near-human performance in short-term scenarios but suffers notable degradation in long-term settings reliant on noisy semantic maps, thereby highlighting the necessity of robust semantic mapping for VLM-driven agricultural automation.
📝 Abstract
Crop monitoring is essential for precision agriculture, but current systems lack high-level reasoning. We introduce a novel, modular framework that uses a Visual Language Model (VLM) to guide robotic task planning, interleaving input queries with action primitives. We contribute a comprehensive benchmark for short- and long-horizon crop monitoring tasks in monoculture and polyculture environments. Our main results show that VLMs perform robustly for short-horizon tasks (comparable to human success), but exhibit significant performance degradation in challenging long-horizon tasks. Critically, the system fails when relying on noisy semantic maps, demonstrating a key limitation in current VLM context grounding for sustained robotic operations. This work offers a deployable framework and critical insights into VLM capabilities and shortcomings for complex agricultural robotics.