Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision

πŸ“… 2026-04-03
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing end-to-end visuomotor policies are difficult to deploy in safety-critical robotic tasks due to their lack of interpretability and modularity. This work proposes a novel framework that integrates vision-language models with neurosymbolic methods to automatically generate structured behavior tree policies from multimodal inputs. By leveraging domain randomization and synthetically generated multimodal data, the approach enables large-scale training without manual annotations. It represents the first deep integration of vision-language models with behavior trees, producing policies that are interpretable, modular, and transferable through synthetic neurosymbolic supervised learning. Experiments demonstrate that policies trained exclusively on synthetic data can efficiently execute complex tasks on two real-world robotic arm platforms, validating the method’s cross-domain transferability and practical utility.
πŸ“ Abstract
Vision-language models (VLMs) have recently demonstrated strong capabilities in mapping multimodal observations to robot behaviors. However, most current approaches rely on end-to-end visuomotor policies that remain opaque and difficult to analyze, limiting their use in safety-critical robotic applications. In contrast, classical robotic systems often rely on structured policy representations that provide interpretability, modularity, and reactive execution. This work investigates how foundation models can be specialized to generate structured robot policies grounded in multimodal perception, bridging high-dimensional learning and symbolic control. We propose a neuro-symbolic approach in which a VLM synthesizes executable Behavior Tree policies from visual observations, natural language instructions, and structured system specifications. To enable scalable supervision without manual annotation, we introduce an automated pipeline that generates a synthetic multimodal dataset of domain-randomized scenes paired with instruction-policy examples produced by a foundation model. Real-world experiments on two robotic manipulators show that structured policies learned entirely from synthetic supervision transfer successfully to physical systems. The results indicate that foundation models can be adapted to produce interpretable and structured robot policies, providing an alternative to opaque end-to-end approaches for multimodal robot decision making.
Problem

Research questions and friction points this paper is trying to address.

structured robot policies
vision-language models
neuro-symbolic supervision
interpretable control
multimodal robot decision making
Innovation

Methods, ideas, or system contributions that make the work stand out.

neuro-symbolic
behavior trees
vision-language models
synthetic supervision
structured robot policies
πŸ”Ž Similar Papers
No similar papers found.
A
Alessandro Adami
University of Padova, Dept. of Information Engineering, Italy.
T
Tommaso Tubaldo
Fraunhofer Italia Research, 39100, Bozen, Italy.
M
Marco Todescato
Fraunhofer Italia Research, 39100, Bozen, Italy.
Ruggero Carli
Ruggero Carli
Associate Professor at University of Padova
Control Theory
Pietro Falco
Pietro Falco
University of Padova, Italy
RoboticsMachine LearningControl Theory