🤖 AI Summary
This work addresses the challenge of aerial grasping in cluttered environments, where occlusions and collision risks hinder reliable end-to-end execution. The authors propose the first unified framework that seamlessly integrates language instruction understanding, active multi-view exploration, 6-DoF grasp generation, and collision-aware feasibility assessment, coupled with standard trajectory planning and control to achieve a closed-loop pipeline from task instruction to grasping action. Experiments in real-world complex scenarios demonstrate that the method significantly enhances grasping robustness and success rate, validating the effectiveness and novelty of jointly designing active perception with feasibility evaluation.
📝 Abstract
Reliable aerial grasping in cluttered environments remains challenging due to occlusions and collision risks. Existing aerial manipulation pipelines largely rely on centroid-based grasping and lack integration between the grasp pose generation models, active exploration, and language-level task specification, resulting in the absence of a complete end-to-end system. In this work, we present an integrated pipeline for reliable aerial grasping in cluttered environments. Given a scene and a language instruction, the system identifies the target object and actively explores it to gain better views of the object. During exploration, a grasp generation network predicts multiple 6-DoF grasp candidates for each view. Each candidate is evaluated using a collision-aware feasibility framework, and the overall best grasp is selected and executed using standard trajectory generation and control methods. Experiments in cluttered real-world scenarios demonstrate robust and reliable grasp execution, highlighting the effectiveness of combining active perception with feasibility-aware grasp selection for aerial manipulation.