Mixture of Horizons in Action Chunking

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-Language-Action (VLA) models in robotic manipulation face a fundamental trade-off between long-horizon foresight and short-horizon control precision due to fixed action horizons. To address this, we propose a hybrid-horizon mechanism: within a shared action Transformer architecture, multi-scale action segments are processed in parallel, and fused via lightweight linear gating and adaptive dynamic inference to ensure cross-horizon consistency. This is the first approach enabling a single model to jointly leverage complementary strengths of short- and long-horizon policies—while maintaining plug-and-play deployment and broad compatibility with both flow-based and regression-based policy paradigms. On LIBERO, our method achieves 99% average success rate with only 30k training steps. It significantly outperforms baselines in generalization to complex tasks and inference throughput—achieving a 2.5× speedup—without compromising control accuracy or planning coherence.

Technology Category

Application Category

📝 Abstract
Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $ extbf{action chunk length}$ used during training, termed $ extbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $ extbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$ imes$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{ ext{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://github.com/Timsty1/MixtureOfHorizons
Problem

Research questions and friction points this paper is trying to address.

Optimizing action chunk length for robotic manipulation tasks
Balancing long-term foresight with fine-grained control precision
Enabling adaptive horizon selection for dynamic inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of horizons segments action chunks with different lengths
Parallel processing with shared transformer and linear gate fusion
Enables dynamic inference through cross-horizon consensus selection
🔎 Similar Papers
No similar papers found.