GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Existing vision-language models struggle to effectively model functional spatial relationships in complex scenes for procedural planning, often overlooking the structured semantics embedded in multimodal inputs. To address this limitation, this work proposes GaLa, a novel framework that introduces hypergraph representations by treating object instances as nodes and aggregating region-level hyperedges based on attribute and functional semantics, thereby explicitly capturing implicit inter-object relationships and hierarchical functional structures. The framework employs a TriView HyperGraph encoder that performs contrastive learning from three complementary perspectives—nodes, regions, and their associations—to enhance structured semantic understanding. Experimental results demonstrate that GaLa significantly outperforms existing methods on the ActPlan1K and ALFRED benchmarks, achieving notable improvements in execution success rate, LCS score, and planning accuracy.

Technology Category

Application Category

📝 Abstract

Implicit spatial relations and deep semantic structures encoded in object attributes are crucial for procedural planning in embodied AI systems. However, existing approaches often over rely on the reasoning capabilities of vision language models (VLMs) themselves, while overlooking the rich structured semantic information that can be mined from multimodal inputs. As a result, models struggle to effectively understand functional spatial relationships in complex scenes. To fully exploit implicit spatial relations and deep semantic structures in multimodal data, we propose GaLa, a vision language framework for multimodal procedural planning. GaLa introduces a hypergraph-based representation, where object instances in the image are modeled as nodes, and region-level hyperedges are constructed by aggregating objects according to their attributes and functional semantics. This design explicitly captures implicit semantic relations among objects as well as the hierarchical organization of functional regions. Furthermore, we design a TriView HyperGraph Encoder that enforces semantic consistency across the node view, area view, and node area association view via contrastive learning, enabling hypergraph semantics to be more effectively injected into downstream VLM reasoning. Extensive experiments on the ActPlan1K and ALFRED benchmarks demonstrate that GaLa significantly outperforms existing methods in terms of execution success rate, LCS, and planning correctness.

Problem

Research questions and friction points this paper is trying to address.

procedural planning

spatial relations

semantic structures

multimodal inputs

embodied AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

hypergraph

vision language models

procedural planning