SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the limited generalization of multimodal large language models (MLLMs) in video temporal grounding tasks, which often stems from coarse-grained perception and reliance on dataset-specific shortcuts. To overcome this, the authors propose a lightweight slot adapter that disentangles visual tokens into abstract slots with object-level semantics. By leveraging object priors from self-supervised vision models, the adapter guides the MLLM toward input-driven, object-centric reasoning. Notably, this approach integrates object-centric representations into MLLMs without requiring end-to-end retraining of the entire multi-stage pipeline. Experiments demonstrate substantial improvements in cross-domain generalization on standard benchmarks while maintaining strong in-domain performance, all with minimal computational overhead.

Technology Category

Application Category

📝 Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.

Problem

Research questions and friction points this paper is trying to address.

Video Temporal Grounding

Out-of-Domain Generalization

Multimodal Large Language Models

Object-Centric Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Slot Attention

Object-Centric Learning

Video Temporal Grounding

Multimodal Large Language Models