ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

πŸ“… 2025-10-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current multimodal large language models (MLLMs) lack fine-grained, structured alignment between pixel-level visual features and textual semantics, limiting their scene understanding and interactive capabilities in embodied intelligence. To address this, we propose SGClipβ€”the first open-domain, annotation-free video scene graph generation model. Built upon the CLIP architecture and a neuro-symbolic learning framework, SGClip employs self-supervised training on video-caption pairs to achieve spatiotemporal structured perception and semantic alignment. It supports prompt-driven reasoning and downstream task fine-tuning. SGClip achieves state-of-the-art performance on multiple scene graph generation and action localization benchmarks. Experiments demonstrate that it significantly reduces perceptual errors, enabling open-source MLLMs to surpass closed-source baselines in two embodied environments.

Technology Category

Application Category

πŸ“ Abstract
Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, current training pipelines primarily rely on high-level vision-sound-text pairs and lack fine-grained, structured alignment between pixel-level visual content and textual semantics. To overcome this challenge, we propose ESCA, a new framework for contextualizing embodied agents through structured spatial-temporal understanding. At its core is SGClip, a novel CLIP-based, open-domain, and promptable model for generating scene graphs. SGClip is trained on 87K+ open-domain videos via a neurosymbolic learning pipeline, which harnesses model-driven self-supervision from video-caption pairs and structured reasoning, thereby eliminating the need for human-labeled scene graph annotations. We demonstrate that SGClip supports both prompt-based inference and task-specific fine-tuning, excelling in scene graph generation and action localization benchmarks. ESCA with SGClip consistently improves both open-source and commercial MLLMs, achieving state-of-the-art performance across two embodied environments. Notably, it significantly reduces agent perception errors and enables open-source models to surpass proprietary baselines.
Problem

Research questions and friction points this paper is trying to address.

Addresses lack of fine-grained visual-textual alignment in MLLMs
Proposes scene graph generation for embodied agent contextualization
Reduces perception errors in embodied agents through structured understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

SGClip generates scene graphs without human annotations
ESCA framework uses structured spatial-temporal understanding
Neurosymbolic learning trains model on video-caption pairs
πŸ”Ž Similar Papers
No similar papers found.
Jiani Huang
Jiani Huang
The Hong Kong Polytechnic University
LLMRecommender System
A
Amish Sethi
University of Pennsylvania
M
Matthew Kuo
University of Pennsylvania
M
Mayank Keoliya
University of Pennsylvania
N
Neelay Velingker
University of Pennsylvania
J
JungHo Jung
University of Pennsylvania
S
Ser-Nam Lim
University of Central Florida
Ziyang Li
Ziyang Li
Johns Hopkins University
Programming LanguagesMachine Learning
Mayur Naik
Mayur Naik
Misra Family Professor of Computer Science, University of Pennsylvania
Programming LanguagesSoftware EngineeringMachine Learning