LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision

📅 2023-04-15

📈 Citations: 3

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the high annotation cost of video spatio-temporal scene graphs (STSGs) by proposing a weakly supervised learning framework that relies solely on video-caption pairs. Methodologically, it introduces the first differentiable symbolic reasoning module jointly optimized with contrastive, temporal, and semantic losses to generate logic-guided STSGs; additionally, it leverages large language models (LLMs) to automatically distill spatio-temporal logical rules, forming a neuro-symbolic architecture. Contributions include: (1) the first end-to-end weakly supervised paradigm for STSG generation without manual STSG annotations; (2) an LLM-driven mechanism for automatic spatio-temporal logical rule induction; and (3) state-of-the-art performance on Something-Something V2, MUGEN, and OpenPVSG, demonstrating substantial improvements in fine-grained video semantic representation.

📝 Abstract

We propose LASER, a neuro-symbolic approach to learn semantic video representations that capture rich spatial and temporal properties in video data by leveraging high-level logic specifications. In particular, we formulate the problem in terms of alignment between raw videos and spatio-temporal logic specifications. The alignment algorithm leverages a differentiable symbolic reasoner and a combination of contrastive, temporal, and semantics losses. It effectively and efficiently trains low-level perception models to extract a fine-grained video representation in the form of a spatio-temporal scene graph that conforms to the desired high-level specification. To practically reduce the manual effort of obtaining ground truth labels, we derive logic specifications from captions by employing a large language model with a generic prompting template. In doing so, we explore a novel methodology that weakly supervises the learning of spatio-temporal scene graphs with widely accessible video-caption data. We evaluate our method on three datasets with rich spatial and temporal specifications: 20BN-Something-Something, MUGEN, and OpenPVSG. We demonstrate that our method learns better fine-grained video semantics than existing baselines.

Problem

Research questions and friction points this paper is trying to address.

Learning spatio-temporal scene graphs without annotated videos

Using video captions as weak supervision for training

Aligning predicted graphs with logical specifications from captions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses large language models for semantic extraction

Trains STSG generators with weak video captions

Employs differentiable symbolic reasoner for alignment

🔎 Similar Papers

No similar papers found.