Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the challenge of generalizing dexterous skill learning for bimanual robots from human demonstration videos, bypassing low-level trajectory imitation to enhance adaptability across diverse objects, spatial configurations, and robotic arm geometries. We propose GF-VLA, a framework that (1) leverages Shannon entropy to identify salient hand–object interactions and construct temporal scene graphs; (2) employs a language-conditioned Transformer to generate interpretable behavior trees and executable motion primitives; and (3) introduces a cross-hand selection strategy to optimize coordinated gripper allocation. The method integrates information-theoretic feature extraction, structured scene modeling, language-guided hierarchical behavior generation, and bimanual closed-loop control. Evaluated on multiple assembly tasks, GF-VLA achieves 95% scene graph accuracy, 93% subtask segmentation precision, 94% grasp success rate, 89% placement accuracy, and 90% overall task completion rate.

Technology Category

Application Category

📝 Abstract

Teaching robots dexterous skills from human videos remains challenging due to the reliance on low-level trajectory imitation, which fails to generalize across object types, spatial layouts, and manipulator configurations. We propose Graph-Fused Vision-Language-Action (GF-VLA), a framework that enables dual-arm robotic systems to perform task-level reasoning and execution directly from RGB and Depth human demonstrations. GF-VLA first extracts Shannon-information-based cues to identify hands and objects with the highest task relevance, then encodes these cues into temporally ordered scene graphs that capture both hand-object and object-object interactions. These graphs are fused with a language-conditioned transformer that generates hierarchical behavior trees and interpretable Cartesian motion commands. To improve execution efficiency in bimanual settings, we further introduce a cross-hand selection policy that infers optimal gripper assignment without explicit geometric reasoning. We evaluate GF-VLA on four structured dual-arm block assembly tasks involving symbolic shape construction and spatial generalization. Experimental results show that the information-theoretic scene representation achieves over 95 percent graph accuracy and 93 percent subtask segmentation, supporting the LLM planner in generating reliable and human-readable task policies. When executed by the dual-arm robot, these policies yield 94 percent grasp success, 89 percent placement accuracy, and 90 percent overall task success across stacking, letter-building, and geometric reconfiguration scenarios, demonstrating strong generalization and robustness across diverse spatial and semantic variations.

Problem

Research questions and friction points this paper is trying to address.

Teaching robots dexterous skills from human videos

Generalizing across object types and spatial layouts

Dual-arm robotic task-level reasoning and execution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-Fused Vision-Language-Action for task reasoning

Information-theoretic scene graphs for interaction modeling

Cross-hand selection policy for bimanual efficiency

🔎 Similar Papers

Towards Synergistic, Generalized, and Efficient Dual-System for Robotic Manipulation