🤖 AI Summary
General-purpose vision-language-action (VLA) policies often struggle with complex tasks requiring fine-grained spatial understanding or high-precision manipulation. This work proposes OmniGuide, a framework that introduces, for the first time, a unified differentiable guidance field to coherently represent heterogeneous guidance sources—such as 3D foundation models, semantic reasoning vision-language models (VLMs), and human pose estimators—as 3D energy fields. By leveraging attractor/repellor mechanisms, OmniGuide steers VLA policy sampling toward superior actions without requiring task-specific integration schemes for each guidance modality. The approach significantly enhances the success rate and safety of general-purpose policies in both simulated and real-world environments, achieving performance on par with or exceeding that of specialized methods when applied to state-of-the-art models such as π₀.₅ and GR00T N1.6.
📝 Abstract
Vision-language-action(VLA) models have shown great promise as generalist policies for a large range of relatively simple tasks. However, they demonstrate limited performance on more complex tasks, such as those requiring complex spatial or semantic understanding, manipulation in clutter, or precise manipulation. We propose OMNIGUIDE, a flexible framework that improves VLA performance on such tasks by leveraging arbitrary sources of guidance, such as 3D foundation models, semantic-reasoning VLMs, and human pose models. We show how many kinds of guidance can be naturally expressed as differentiable energy functions with task-specific attractors and repellers located in 3D space, that influence the sampling of VLA actions. In this way, OMNIGUIDE enables guidance sources with complementary task-relevant strengths to improve a VLA model's performance on challenging tasks. Extensive experiments in both simulation and real-world environments, across diverse sources of guidance, demonstrate that OMNIGUIDE enhances the performance of state-of-the-art generalist policies (e.g., $\pi_{0.5}$, GR00T N1.6) significantly across success and safety rates. Critically, our unified framework matches or surpasses the performance of prior methods designed to incorporate specific sources of guidance into VLA policies. Project Page: $\href{https://omniguide.github.io/}{this \; url}$