DAM-VLA: A Dynamic Action Model-Based Vision-Language-Action Framework for Robot Manipulation

📅 2026-03-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) approaches struggle to balance task generality with fine-grained manipulation accuracy in dynamic environments. This work proposes a novel framework that integrates a vision-language model with a diffusion-based action model, enabling dynamic coordination between high-level semantic instructions and low-level visual features during action generation. The integration is achieved through an action routing mechanism, a dynamic action model, and a dual-scale action weighting strategy. Evaluated on both simulated and real-world benchmarks—including SIMPLER and FurnitureBench—the proposed method significantly outperforms current VLA approaches, successfully executing tasks ranging from basic grasping to complex, long-horizon, contact-rich manipulation. These results demonstrate exceptional generalization capability and precise control across diverse and challenging scenarios.

Technology Category

Application Category

📝 Abstract
In dynamic environments such as warehouses, hospitals, and homes, robots must seamlessly transition between gross motion and precise manipulations to complete complex tasks. However, current Vision-Language-Action (VLA) frameworks, largely adapted from pre-trained Vision-Language Models (VLMs), often struggle to reconcile general task adaptability with the specialized precision required for intricate manipulation. To address this challenge, we propose DAM-VLA, a dynamic action model-based VLA framework. DAM-VLA integrates VLM reasoning with diffusion-based action models specialized for arm and gripper control. Specifically, it introduces (i) an action routing mechanism, using task-specific visual and linguistic cues to select appropriate action models (e.g., arm movement or gripper manipulation), (ii) a dynamic action model that fuses high-level VLM cognition with low-level visual features to predict actions, and (iii) a dual-scale action weighting mechanism that enables dynamic coordination between the arm-movement and gripper-manipulation models. Across extensive evaluations, DAM-VLA achieves superior success rates compared to state-of-the-art VLA methods in simulated (SIMPLER, FurnitureBench) and real-world settings, showing robust generalization from standard pick-and-place to demanding long-horizon and contact-rich tasks.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
robot manipulation
dynamic environments
action precision
task adaptability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action
Dynamic Action Model
Action Routing
Diffusion-based Control
Robotic Manipulation
X
Xiongfeng Peng
Advanced Research Lab, Samsung R&D Institute China-Beijing (SRCB), China
Jiaqian Yu
Jiaqian Yu
Samsung R&D Institute China - Beijing
Machine LearningComputer Vision
D
Dingzhe Li
Advanced Research Lab, Samsung R&D Institute China-Beijing (SRCB), China
Yixiang Jin
Yixiang Jin
Samsung R&D Institute China - Beijing
RoboticsRobot LearningRobot Simulator
Lu Xu
Lu Xu
Postdoc, Riken AIP
deep learningmachine learningcomputer vision
Y
Yamin Mao
Advanced Research Lab, Samsung R&D Institute China-Beijing (SRCB), China
C
Chao Zhang
Advanced Research Lab, Samsung R&D Institute China-Beijing (SRCB), China
Weiming Li
Weiming Li
Principal Engineer, Samsung Electronics
Computer VisionAugmented RealityComputational Imaging and Display
Sujin Jang
Sujin Jang
Principal Researcher, Samsung AI Center (DS Division)
Machine LearningRoboticsComputer VisionHuman-Computer Interaction
Dongwook Lee
Dongwook Lee
SAIT, Samsung Electronics
Deep learningSignal ProcessingGenerative modelGANComputer Vision
D
Daehyun Ji
Samsung AI Center, DS Division, South Korea