TA-VLA: Elucidating the Design Space of Torque-aware Vision-Language-Action Models

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) models struggle to model torque signals inherent in physical interactions, resulting in a lack of closed-loop force control capability for high-contact robotic manipulation tasks. To address this, we propose TorqueAdapter: a lightweight torque adapter embedded within the VLA decoder, coupled with torque prediction as an auxiliary supervision objective to encourage learning of physically grounded internal representations. Unlike conventional encoder-side multimodal fusion, this decoder-centric design avoids challenging cross-modal feature alignment. Inspired by joint prediction-and-planning paradigms in autonomous driving, we unify torque-aware perception and action generation within a single modeling framework. Evaluated on multiple high-contact manipulation benchmarks—including TORQUE-Bench—TorqueAdapter significantly improves task success rates and force-control robustness. Our results demonstrate that explicit torque modeling provides a critical gain in the physical understanding capability of VLA models.

Technology Category

Application Category

📝 Abstract
Many robotic manipulation tasks require sensing and responding to force signals such as torque to assess whether the task has been successfully completed and to enable closed-loop control. However, current Vision-Language-Action (VLA) models lack the ability to integrate such subtle physical feedback. In this work, we explore Torque-aware VLA models, aiming to bridge this gap by systematically studying the design space for incorporating torque signals into existing VLA architectures. We identify and evaluate several strategies, leading to three key findings. First, introducing torque adapters into the decoder consistently outperforms inserting them into the encoder.Third, inspired by joint prediction and planning paradigms in autonomous driving, we propose predicting torque as an auxiliary output, which further improves performance. This strategy encourages the model to build a physically grounded internal representation of interaction dynamics. Extensive quantitative and qualitative experiments across contact-rich manipulation benchmarks validate our findings.
Problem

Research questions and friction points this paper is trying to address.

Integrating torque signals into Vision-Language-Action models
Enabling force feedback for robotic manipulation tasks
Exploring design strategies for torque-aware model architectures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Torque adapters integrated into decoder
Predicting torque as auxiliary output
Physically grounded internal representation dynamics
🔎 Similar Papers
No similar papers found.
Z
Zongzheng Zhang
Beijing Academy of Artificial Intelligence, BAAI
H
Haobo Xu
Institute for AI Industry Research (AIR), Tsinghua University
Zhuo Yang
Zhuo Yang
Xidian University & Shanghai AI Laboratory
Lauge Language ModelAI for Science
C
Chenghao Yue
Beijing Academy of Artificial Intelligence, BAAI
Z
Zehao Lin
Beijing Academy of Artificial Intelligence, BAAI
Huan-ang Gao
Huan-ang Gao
Ph.D. student, Tsinghua University
AgentVision & Robotics
Z
Ziwei Wang
Nanyang Technological University
H
Hao Zhao
Beijing Academy of Artificial Intelligence, BAAI; Institute for AI Industry Research (AIR), Tsinghua University