🤖 AI Summary
Existing vision-language-action (VLA) approaches struggle to balance task generality with fine-grained manipulation accuracy in dynamic environments. This work proposes a novel framework that integrates a vision-language model with a diffusion-based action model, enabling dynamic coordination between high-level semantic instructions and low-level visual features during action generation. The integration is achieved through an action routing mechanism, a dynamic action model, and a dual-scale action weighting strategy. Evaluated on both simulated and real-world benchmarks—including SIMPLER and FurnitureBench—the proposed method significantly outperforms current VLA approaches, successfully executing tasks ranging from basic grasping to complex, long-horizon, contact-rich manipulation. These results demonstrate exceptional generalization capability and precise control across diverse and challenging scenarios.
📝 Abstract
In dynamic environments such as warehouses, hospitals, and homes, robots must seamlessly transition between gross motion and precise manipulations to complete complex tasks. However, current Vision-Language-Action (VLA) frameworks, largely adapted from pre-trained Vision-Language Models (VLMs), often struggle to reconcile general task adaptability with the specialized precision required for intricate manipulation. To address this challenge, we propose DAM-VLA, a dynamic action model-based VLA framework. DAM-VLA integrates VLM reasoning with diffusion-based action models specialized for arm and gripper control. Specifically, it introduces (i) an action routing mechanism, using task-specific visual and linguistic cues to select appropriate action models (e.g., arm movement or gripper manipulation), (ii) a dynamic action model that fuses high-level VLM cognition with low-level visual features to predict actions, and (iii) a dual-scale action weighting mechanism that enables dynamic coordination between the arm-movement and gripper-manipulation models. Across extensive evaluations, DAM-VLA achieves superior success rates compared to state-of-the-art VLA methods in simulated (SIMPLER, FurnitureBench) and real-world settings, showing robust generalization from standard pick-and-place to demanding long-horizon and contact-rich tasks.