🤖 AI Summary
This work addresses the challenge of grounding language instructions to actions in cluttered environments, where occlusions and viewpoint variations hinder consistent 3D spatial understanding in existing vision-language-action models. The authors propose a geometry-aware multi-view fusion approach that constructs geometrically consistent representations by predicting depth distributions for visual tokens, performing differentiable 3D lifting, and aggregating local features across views. A Perceiver-style text-aware readout mechanism enables fine-grained instruction grounding on top of frozen CLIP features. During training, depth distillation is introduced to enhance geometric priors without increasing inference cost. Evaluated on RoboTwin 2.0 with domain randomization, the method achieves a 23.0 percentage point improvement in success rate over the strongest baseline, and real-robot experiments demonstrate effective sim-to-real transfer and consistent performance gains from depth distillation.
📝 Abstract
Bimanual manipulation in cluttered scenes requires policies that remain stable under occlusions, viewpoint and scene variations. Existing vision-language-action models often fail to generalize because (i) multi-view features are fused via view-agnostic token concatenation, yielding weak 3D-consistent spatial understanding, and (ii) language is injected as global conditioning, resulting in coarse instruction grounding. In this paper, we introduce PEAfowl, a perception-enhanced multi-view VLA policy for bimanual manipulation. For spatial reasoning, PEAfowl predicts per-token depth distributions, performs differentiable 3D lifting, and aggregates local cross-view neighbors to form geometrically grounded, cross-view consistent representations. For instruction grounding, we propose to replace global conditioning with a Perceiver-style text-aware readout over frozen CLIP visual features, enabling iterative evidence accumulation. To overcome noisy and incomplete commodity depth without adding inference overhead, we apply training-only depth distillation from a pretrained depth teacher to supervise the depth-distribution head, providing perception front-end with geometry-aware priors. On RoboTwin 2.0 under domain-randomized setting, PEAfowl improves the strongest baseline by 23.0 pp in success rate, and real-robot experiments further demonstrate reliable sim-to-real transfer and consistent improvements from depth distillation. Project website: https://peafowlvla.github.io/.