LogSTOP: Temporal Scores over Prediction Sequences for Matching and Retrieval

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the problem of modeling trustworthiness for temporal attributes in video/audio (e.g., object appearance, emotion shifts). We propose LogSTOP, a novel scoring function enabling efficient, differentiable evaluation of Linear Temporal Logic (LTL) formulas over continuous frame-level score sequences. Our method integrates multimodal local detectors—including YOLO, HuBERT, Grounding DINO, and SlowR50—to generate fine-grained per-frame scores; encodes temporal semantic constraints via LTL; and performs semantics-preserving score aggregation using LogSTOP. Unlike large language models or conventional symbolic temporal logic approaches, LogSTOP achieves both interpretability and computational efficiency. Experiments demonstrate: (i) ≥16% improvement in temporal query matching accuracy; (ii) 19% and 16% gains in mean average precision and recall, respectively, for video retrieval; and (iii) significantly enhanced accuracy for complex, semantics-driven cross-modal retrieval.

Technology Category

Application Category

📝 Abstract

Neural models such as YOLO and HuBERT can be used to detect local properties such as objects ("car") and emotions ("angry") in individual frames of videos and audio clips respectively. The likelihood of these detections is indicated by scores in [0, 1]. Lifting these scores to temporal properties over sequences can be useful for several downstream applications such as query matching (e.g., "does the speaker eventually sound happy in this audio clip?"), and ranked retrieval (e.g., "retrieve top 5 videos with a 10 second scene where a car is detected until a pedestrian is detected"). In this work, we formalize this problem of assigning Scores for TempOral Properties (STOPs) over sequences, given potentially noisy score predictors for local properties. We then propose a scoring function called LogSTOP that can efficiently compute these scores for temporal properties represented in Linear Temporal Logic. Empirically, LogSTOP, with YOLO and HuBERT, outperforms Large Vision / Audio Language Models and other Temporal Logic-based baselines by at least 16% on query matching with temporal properties over objects-in-videos and emotions-in-speech respectively. Similarly, on ranked retrieval with temporal properties over objects and actions in videos, LogSTOP with Grounding DINO and SlowR50 reports at least a 19% and 16% increase in mean average precision and recall over zero-shot text-to-video retrieval baselines respectively.

Problem

Research questions and friction points this paper is trying to address.

Assigning temporal scores to noisy local property predictions in sequences

Enabling query matching for temporal properties in videos and audio

Improving ranked retrieval accuracy for temporal object and action sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

LogSTOP computes temporal scores from local predictions

Uses Linear Temporal Logic for property representation

Outperforms large models in matching and retrieval tasks

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs