Habilis-$β$: A Fast-Motion and Long-Lasting On-Device Vision-Language-Action Model

📅 2026-02-21

📈 Citations: 0

✨ Influential: 0

career value

245K/year

🤖 AI Summary

This work addresses the limitations of existing vision-language-action models, which are typically evaluated solely on single-task success rates and thus fail to capture the throughput and long-term reliability required for real-world deployment. To bridge this gap, the authors propose a vision-language-action model tailored for edge-based real-world scenarios, introducing the Productivity-Reliability Plane (PRP) evaluation framework grounded in continuous-operation protocols. Key innovations include language-agnostic pretraining on large-scale play data, cyclic task fine-tuning, phase-adaptive motion planning (ESPADA), rectified flow distillation, and classifier-free guidance. The model achieves 572.6 tasks per hour (TPH) with a mean time between interventions (MTBI) of 39.2 seconds in simulation, and 124 TPH with 137.4 seconds MTBI on real-world logistics tasks—significantly outperforming baselines and establishing state-of-the-art performance on the RoboTwin 2.0 benchmark.

Technology Category

Application Category

📝 Abstract

We introduce Habilis-$β$, a fast-motion and long-lasting on-device vision-language-action (VLA) model designed for real-world deployment. Current VLA evaluation remains largely confined to single-trial success rates under curated resets, which fails to capture the fast-motion and long-lasting capabilities essential for practical operation. To address this, we introduce the Productivity-Reliability Plane (PRP), which evaluates performance through Tasks per Hour (TPH) and Mean Time Between Intervention (MTBI) under a continuous-run protocol that demands both high-speed execution and sustained robustness. Habilis-$β$ achieves high performance by integrating language-free pre-training on large-scale play data for robust interaction priors with post-training on cyclic task demonstrations that capture state drift across consecutive task iterations. The system further employs ESPADA for phase-adaptive motion shaping to accelerate free-space transit, utilizes rectified-flow distillation to enable high-frequency control on edge devices, and incorporates classifier-free guidance (CFG) as a deployment-time knob to dynamically balance instruction adherence and learned interaction priors. In 1-hour continuous-run evaluations, Habilis-$β$ achieves strong performance under the PRP metrics, compared to $π_{0.5}$ in both simulation and real-world environments. In simulation, Habilis-$β$ achieves 572.6 TPH and 39.2 s MTBI (vs. 120.5 TPH and 30.5 s for $π_{0.5}$), while in a real-world humanoid logistics workflow it achieves 124 TPH and 137.4 s MTBI (vs. 19 TPH and 46.1 s for $π_{0.5}$). Finally, Habilis-$β$ achieves the highest reported performance on the standard RoboTwin 2.0 leaderboard across representative tasks, validating its effectiveness in complex manipulation scenarios.

Problem

Research questions and friction points this paper is trying to address.

vision-language-action

fast-motion

long-lasting

on-device

continuous-run evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language-action

on-device inference

continuous-run evaluation