ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

πŸ“… 2026-05-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

192K/year
πŸ€– AI Summary
This study addresses the limitations of existing vision-language manipulation research, which primarily targets rigid robotic arms and struggles in confined or cluttered environments. While soft continuum robots offer superior adaptability through deformation, they face significant challenges due to unreliable proprioception and distributed actuation. To bridge this gap, this work introduces ManiSoft, the first vision-language manipulation benchmark for soft continuum robots. It proposes elastic force constraints to couple soft-body dynamics with contact interactions and integrates a high-level path planner with a low-level reinforcement learning policy to generate torque commands. Leveraging a custom simulator and automated pipeline, the authors generate 6,300 diverse scenarios with expert trajectories. Experiments reveal that current policies perform well in clean settings but degrade substantially under randomization, primarily due to inaccurate visual estimation of proprioceptive states and insufficient exploitation of deformation for obstacle avoidance.
πŸ“ Abstract
Most existing vision-language manipulation research targets rigid robotic arms, whose fixed morphology limits adaptability in cluttered or confined spaces. Soft robotic arms offer an appealing alternative due to their deformability, but confront challenges such as unreliable proprioception and distributed low-level actuation. To investigate these challenges, we introduce \ManiSoft, a benchmark for vision-language manipulation with soft arms. ManiSoft features a tailored simulator that couples realistic soft-body dynamics with contact-rich interactions via an elastic force constraint. On this basis, ManiSoft defines four tasks, each highlighting distinct aspects of deformable control, from basic end-effector coordination to obstacle avoidance. To support policy training and evaluation, \ManiSoft{} includes an automated pipeline that generates $6{,}300$ diverse scenes and corresponding expert trajectories. To produce high-quality trajectories at scale, we first employ a high-level planner to decompose each task into a sequence of waypoints, followed by a low-level reinforcement learning policy that generates torque commands to track waypoints. Benchmarking three representative policy models shows relatively promising results in clean scenes but substantial performance drop under randomization. Visualization analysis indicates that failures stem primarily from inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding. We anticipate ManiSoft to serve as a valuable testbed, bridging the gap between rigid and soft arms in the context of vision-language manipulation. Out codes and datasets are released at https://buaa-colalab.github.io/ManiSoft.
Problem

Research questions and friction points this paper is trying to address.

soft robotics
vision-language manipulation
continuum robots
proprioception
deformable control
Innovation

Methods, ideas, or system contributions that make the work stand out.

soft continuum robotics
vision-language manipulation
deformable control
simulation benchmark
reinforcement learning