LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision

📅 2023-04-15
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high annotation cost of video spatio-temporal scene graphs (STSGs) by proposing a weakly supervised learning framework that relies solely on video-caption pairs. Methodologically, it introduces the first differentiable symbolic reasoning module jointly optimized with contrastive, temporal, and semantic losses to generate logic-guided STSGs; additionally, it leverages large language models (LLMs) to automatically distill spatio-temporal logical rules, forming a neuro-symbolic architecture. Contributions include: (1) the first end-to-end weakly supervised paradigm for STSG generation without manual STSG annotations; (2) an LLM-driven mechanism for automatic spatio-temporal logical rule induction; and (3) state-of-the-art performance on Something-Something V2, MUGEN, and OpenPVSG, demonstrating substantial improvements in fine-grained video semantic representation.
📝 Abstract
We propose LASER, a neuro-symbolic approach to learn semantic video representations that capture rich spatial and temporal properties in video data by leveraging high-level logic specifications. In particular, we formulate the problem in terms of alignment between raw videos and spatio-temporal logic specifications. The alignment algorithm leverages a differentiable symbolic reasoner and a combination of contrastive, temporal, and semantics losses. It effectively and efficiently trains low-level perception models to extract a fine-grained video representation in the form of a spatio-temporal scene graph that conforms to the desired high-level specification. To practically reduce the manual effort of obtaining ground truth labels, we derive logic specifications from captions by employing a large language model with a generic prompting template. In doing so, we explore a novel methodology that weakly supervises the learning of spatio-temporal scene graphs with widely accessible video-caption data. We evaluate our method on three datasets with rich spatial and temporal specifications: 20BN-Something-Something, MUGEN, and OpenPVSG. We demonstrate that our method learns better fine-grained video semantics than existing baselines.
Problem

Research questions and friction points this paper is trying to address.

Learning spatio-temporal scene graphs without annotated videos
Using video captions as weak supervision for training
Aligning predicted graphs with logical specifications from captions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses large language models for semantic extraction
Trains STSG generators with weak video captions
Employs differentiable symbolic reasoner for alignment
🔎 Similar Papers
No similar papers found.
Jiani Huang
Jiani Huang
The Hong Kong Polytechnic University
LLMRecommender System
Ziyang Li
Ziyang Li
Johns Hopkins University
Programming LanguagesMachine Learning
D
David Jacobs
M
M. Naik
University of Pennsylvania
S
S. Lim
University of Central Florida