🤖 AI Summary
Agricultural robots frequently encounter challenges in accurately identifying contact materials (e.g., leaves, branches, trunks, background) during contact-intensive tasks such as pruning and harvesting, primarily due to severe visual occlusion and unstructured field environments—compromising operational safety. To address this, we propose the first audio-visual multimodal contact classification method tailored for agricultural scenarios. Our approach jointly models vibration-induced audio signals and RGB visual features via time-frequency analysis and cross-modal feature alignment. Crucially, it achieves zero-shot cross-form generalization—from handheld sensor data to robot-mounted end-effectors—without requiring retraining. Evaluated on a four-class material identification task in real-world orchards, our method attains an F1-score of 0.82, significantly enhancing the robustness and practicality of contact perception under unstructured, dynamic field conditions.
📝 Abstract
Contact-rich manipulation tasks in agriculture, such as pruning and harvesting, require robots to physically interact with tree structures to maneuver through cluttered foliage. Identifying whether the robot is contacting rigid or soft materials is critical for the downstream manipulation policy to be safe, yet vision alone is often insufficient due to occlusion and limited viewpoints in this unstructured environment. To address this, we propose a multi-modal classification framework that fuses vibrotactile (audio) and visual inputs to identify the contact class: leaf, twig, trunk, or ambient. Our key insight is that contact-induced vibrations carry material-specific signals, making audio effective for detecting contact events and distinguishing material types, while visual features add complementary semantic cues that support more fine-grained classification. We collect training data using a hand-held sensor probe and demonstrate zero-shot generalization to a robot-mounted probe embodiment, achieving an F1 score of 0.82. These results underscore the potential of audio-visual learning for manipulation in unstructured, contact-rich environments.