Geometry Guided Self-Consistency for Physical AI

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Diffusion models in physical AI often produce stochastic action trajectories in a single forward pass, leading to fragile decision-making and error accumulation. To address this, this work proposes KeyStone, a training-free inference-time method that samples K action trajectories in parallel and leverages the geometric structure of the continuous action space to cluster them, returning the centroid of the largest cluster as the final trajectory. KeyStone is the first approach to exploit action-space geometry for self-consistent trajectory selection without any additional training overhead, significantly enhancing stability at negligible computational cost. Experiments demonstrate that KeyStone improves task success rates by up to 13.3% across multiple physical AI models—including VLA and WAM—while introducing negligible latency and matching the performance of trainable selectors.

📝 Abstract

State-of-the-art physical AI models generate a chunk of actions per inference through diffusion or flow matching, iteratively refining an initial noise sample into an action trajectory. Because this inference process is inherently stochastic, committing to a single trajectory per round is brittle, and this brittleness compounds across the many sequential rounds that comprise a complete episode. We introduce KeyStone, an inference-time self-consistency method for diffusion-based action generation that draws $K$ candidate action chunks in parallel from a shared model context, clusters them in continuous action space, and returns the medoid of the largest cluster -- no additional model required. Two properties make this practical. First, the compact nature of action trajectories makes diffusion inference memory-bandwidth bound, leaving spare compute capacity to run $K$ chains in parallel with no additional wall-clock latency. Second, unlike token or pixel spaces where distance carries no semantic meaning and selection requires a learned judge, action chunks are geometrically structured such that Euclidean distance directly reflects physical similarity, making selection principled and judge-free. Across diverse vision-language-action models (VLAs) and world-action models (WAMs), KeyStone improves task success rates by up to \textbf{13.3\%} over single-trajectory sampling with negligible latency overhead, while having on par accuracy with model-based selectors at no training cost. We open source KeyStone at https://github.com/dywsjtu/keystone.

Problem

Research questions and friction points this paper is trying to address.

physical AI

diffusion models

action trajectory

stochastic inference

trajectory selection

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-consistency

diffusion-based action generation

geometric clustering