🤖 AI Summary
This work proposes a task-driven active perception framework that overcomes the limitations of traditional Real2Sim approaches, which rely on manual measurements or fixed exploration protocols and struggle to adapt to diverse tasks and user intents. By integrating vision-language models (VLMs) with behavior trees for the first time, the method leverages multimodal reasoning to translate high-level user instructions into executable compliant interaction policies. Deployed on a Franka robotic arm, the system autonomously explores objects to estimate key physical parameters—such as mass, friction coefficient, and surface height—without requiring predefined templates or expert intervention. The approach is interpretable, intent-driven, and demonstrates strong robustness and generalization in challenging scenarios involving occlusions or absent prior models, thereby enabling the construction of high-fidelity simulation environments.
📝 Abstract
Constructing an accurate simulation model of real-world environments requires reliable estimation of physical parameters such as mass, geometry, friction, and contact surfaces. Traditional real-to-simulation (Real2Sim) pipelines rely on manual measurements or fixed, pre-programmed exploration routines, which limit their adaptability to varying tasks and user intents. This paper presents a Real2Sim framework that autonomously generates and executes Behavior Trees for task-specific physical interactions to acquire only the parameters required for a given simulation objective, without relying on pre-defined task templates or expert-designed exploration routines. Given a high-level user request, an incomplete simulation description, and an RGB observation of the scene, a vision-language model performs multi-modal reasoning to identify relevant objects, infer required physical parameters, and generate a structured Behavior Tree composed of elementary robotic actions. The resulting behavior is executed on a torque-controlled Franka Emika Panda, enabling compliant, contact-rich interactions for parameter estimation. The acquired measurements are used to automatically construct a physics-aware simulation. Experimental results on the real manipulator demonstrate estimation of object mass, surface height, and friction-related quantities across multiple scenarios, including occluded objects and incomplete prior models. The proposed approach enables interpretable, intent-driven, and autonomously Real2Sim pipelines, bridging high-level reasoning with physically-grounded robotic interaction.