TA-VLA: Elucidating the Design Space of Torque-aware Vision-Language-Action Models

📅 2025-09-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language-action (VLA) models struggle to model torque signals inherent in physical interactions, resulting in a lack of closed-loop force control capability for high-contact robotic manipulation tasks. To address this, we propose TorqueAdapter: a lightweight torque adapter embedded within the VLA decoder, coupled with torque prediction as an auxiliary supervision objective to encourage learning of physically grounded internal representations. Unlike conventional encoder-side multimodal fusion, this decoder-centric design avoids challenging cross-modal feature alignment. Inspired by joint prediction-and-planning paradigms in autonomous driving, we unify torque-aware perception and action generation within a single modeling framework. Evaluated on multiple high-contact manipulation benchmarks—including TORQUE-Bench—TorqueAdapter significantly improves task success rates and force-control robustness. Our results demonstrate that explicit torque modeling provides a critical gain in the physical understanding capability of VLA models.

Technology Category

Application Category

📝 Abstract

Many robotic manipulation tasks require sensing and responding to force signals such as torque to assess whether the task has been successfully completed and to enable closed-loop control. However, current Vision-Language-Action (VLA) models lack the ability to integrate such subtle physical feedback. In this work, we explore Torque-aware VLA models, aiming to bridge this gap by systematically studying the design space for incorporating torque signals into existing VLA architectures. We identify and evaluate several strategies, leading to three key findings. First, introducing torque adapters into the decoder consistently outperforms inserting them into the encoder.Third, inspired by joint prediction and planning paradigms in autonomous driving, we propose predicting torque as an auxiliary output, which further improves performance. This strategy encourages the model to build a physically grounded internal representation of interaction dynamics. Extensive quantitative and qualitative experiments across contact-rich manipulation benchmarks validate our findings.

Problem

Research questions and friction points this paper is trying to address.

Integrating torque signals into Vision-Language-Action models

Enabling force feedback for robotic manipulation tasks

Exploring design strategies for torque-aware model architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Torque adapters integrated into decoder

Predicting torque as auxiliary output

Physically grounded internal representation dynamics

🔎 Similar Papers

No similar papers found.

Authors to Follow