TriRelVLA: Triadic Relational Structure for Generalizable Embodied Manipulation

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing vision-language-action (VLA) models struggle to generalize to unseen scenes, objects, or tasks due to entangled visual representations that implicitly mix appearance, background, and spatial layout information. This work proposes TriRelVLA, a novel framework that explicitly models the object–hand–task triadic relationship as the core structural prior for embodied manipulation. The approach constructs a task-anchored relational graph and employs a relational bottleneck mechanism to disentangle action prediction from visual appearance. By fusing multimodal inputs into relational representations, it leverages task-guided cross-attention and a relation-aware Graph Transformer to capture interactive dynamics. The compressed relational structure is then injected into a large language model for action generation. Evaluated on a newly curated real-robot fine-tuning dataset, TriRelVLA demonstrates significantly improved generalization across diverse cross-scene, cross-object, and cross-task settings.

📝 Abstract

Vision-language-action (VLA) models perform well on training-seen robotic tasks but struggle to generalize to unseen scenes and objects. A key limitation lies in their implicit visual representations, which entangle object appearance, background, and scene layout. This makes policies sensitive to visual variations. Prior work improves transferability through structured intermediate representations that objectify visual content. However, these representations mainly capture scene semantics instead of action-relevant relations. As a result, action prediction remains tied to appearance statistics. We observe that manipulation actions depend on the object-hand-task relational structure, which governs interactions among task requirements, robot states, and object properties. Based on this observation, we propose TriRelVLA, a triadic relational VLA framework for generalizable embodied manipulation. Our approach consists of three components: 1) We construct explicit object-hand-task triadic representations from multimodal inputs as relational primitives. 2) We build a task-grounded relational graph. Task-guided cross-attention forms nodes, and a relation-aware graph transformer models interactions among them. 3) We perform relation-conditioned action generation. The relational structure is compressed into a bottleneck space and projected into the LLM for action prediction. This triadic relational bottleneck reduces reliance on appearance statistics and enables transfer across scenes, objects, and task compositions. We further introduce a real-world robotic dataset for fine-tuning. Experiments show strong performance on fine-tuned tasks and clear gains in cross-scene, cross-object, and cross-task generalization.

Problem

Research questions and friction points this paper is trying to address.

generalizable manipulation

vision-language-action models

relational structure

embodied AI

visual generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

triadic relational structure

vision-language-action models

relational graph transformer