🤖 AI Summary
Vision-Language-Action (VLA) models in robotic manipulation face a fundamental trade-off between long-horizon foresight and short-horizon control precision due to fixed action horizons. To address this, we propose a hybrid-horizon mechanism: within a shared action Transformer architecture, multi-scale action segments are processed in parallel, and fused via lightweight linear gating and adaptive dynamic inference to ensure cross-horizon consistency. This is the first approach enabling a single model to jointly leverage complementary strengths of short- and long-horizon policies—while maintaining plug-and-play deployment and broad compatibility with both flow-based and regression-based policy paradigms. On LIBERO, our method achieves 99% average success rate with only 30k training steps. It significantly outperforms baselines in generalization to complex tasks and inference throughput—achieving a 2.5× speedup—without compromising control accuracy or planning coherence.
📝 Abstract
Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $ extbf{action chunk length}$ used during training, termed $ extbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $ extbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$ imes$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{ ext{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons