Mag-VLA: Vision-Language-Action Model for Bimanual Magnetically Actuated Microrobot Manipulation

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the significant challenges in dual-arm cooperative control of magnetically driven microrobots, which stem from indirect actuation, limited perception, and nonlinear magnetic coupling. The authors propose Mag-VLA, the first hierarchical vision–language–action (VLA) framework tailored to this domain, leveraging a Qwen2.5-VL-7B backbone with LoRA fine-tuning to effectively integrate visual observations and natural language instructions. A motion-aware phase classifier and a phase-conditioned ACT decoder are introduced to enable temporally consistent, multi-step coordinated manipulation. Evaluated on a newly collected teleoperated dataset encompassing three task categories, the system achieves an overall near-success rate of approximately 90% in real-world experiments, with success rates of 80%, 70%, and 50% on increasingly complex transportation tasks—substantially outperforming existing approaches—and demonstrates dexterous capabilities such as microrobot reorientation, which are infeasible with a single arm.

📝 Abstract

Magnetically actuated microrobots have been used as wireless, non-contact manipulation tools at microscales, making them promising for minimally invasive applications. However, their control remains challenging due to indirect actuation, limited sensing, and nonlinear magnetic interactions. In this work, we propose Mag-VLA, a vision-language-action (VLA) model for dexterous magnetic microrobot manipulation using two robotic arms with mounted magnets for dynamic magnetic-field construction. Bimanual coordination enables capabilities such as microrobot reorientation that are difficult or infeasible with a single arm, but it also introduces coupled control challenges, as the policy must generate coordinated trajectories for both actuators within a shared workspace. Our framework adapts a Qwen2.5-VL-7B backbone using Low-Rank Adaptation (LoRA) to process visual observations and language instructions for action prediction. To capture task progression, we introduce a motion-aware phase classifier and a phase-conditioned Action Chunking Transformer (ACT) decoder for temporally coherent multi-step control. We further construct a teleoperated magnetic microrobot manipulation dataset covering three task configurations. Ablation studies show that the ACT-based decoder substantially outperforms alternative generative action heads. In real-robot experiments, Mag-VLA achieves a 90% approach success rate across all tasks and transport success rates of 80%, 70%, and 50% as task difficulty increases. These results demonstrate that hierarchical VLA modeling provides a promising framework for magnetic microrobot manipulation.

Problem

Research questions and friction points this paper is trying to address.

magnetically actuated microrobots

bimanual manipulation

vision-language-action

coordinated control

microscale manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)

bimanual magnetic manipulation

Low-Rank Adaptation (LoRA)

Action Chunking Transformer (ACT)

microrobot control

🔎 Similar Papers

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance

2024-05-22arXiv.orgCitations: 1

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey

2024-04-28arXiv.orgCitations: 15