🤖 AI Summary
Existing vision-language-action (VLA) models struggle to model torque signals inherent in physical interactions, resulting in a lack of closed-loop force control capability for high-contact robotic manipulation tasks. To address this, we propose TorqueAdapter: a lightweight torque adapter embedded within the VLA decoder, coupled with torque prediction as an auxiliary supervision objective to encourage learning of physically grounded internal representations. Unlike conventional encoder-side multimodal fusion, this decoder-centric design avoids challenging cross-modal feature alignment. Inspired by joint prediction-and-planning paradigms in autonomous driving, we unify torque-aware perception and action generation within a single modeling framework. Evaluated on multiple high-contact manipulation benchmarks—including TORQUE-Bench—TorqueAdapter significantly improves task success rates and force-control robustness. Our results demonstrate that explicit torque modeling provides a critical gain in the physical understanding capability of VLA models.
📝 Abstract
Many robotic manipulation tasks require sensing and responding to force signals such as torque to assess whether the task has been successfully completed and to enable closed-loop control. However, current Vision-Language-Action (VLA) models lack the ability to integrate such subtle physical feedback. In this work, we explore Torque-aware VLA models, aiming to bridge this gap by systematically studying the design space for incorporating torque signals into existing VLA architectures. We identify and evaluate several strategies, leading to three key findings. First, introducing torque adapters into the decoder consistently outperforms inserting them into the encoder.Third, inspired by joint prediction and planning paradigms in autonomous driving, we propose predicting torque as an auxiliary output, which further improves performance. This strategy encourages the model to build a physically grounded internal representation of interaction dynamics. Extensive quantitative and qualitative experiments across contact-rich manipulation benchmarks validate our findings.