EVIDENT: Routing MLLM Adaptation through Entity-Grounded Visual Evidence for Cross-Domain Video Temporal Grounding

πŸ“… 2026-05-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the vulnerability of multimodal large language models to visual domain shifts in video temporal grounding tasks, which often leads to significant performance degradation in out-of-domain settings. To mitigate this issue, the authors propose an entity-centric inductive bias that injects objectness priors during parameter-efficient fine-tuning. Specifically, they anchor the pretrained model’s entity attention mechanism through three components: an entity bottleneck adapter, an entity-bound distillation loss, and an entity-to-evidence gating mechanism. This design explicitly leverages visual entity evidence rather than relying on dataset-specific shortcuts. The proposed method substantially improves out-of-domain robustness on cross-domain benchmarks while maintaining strong in-domain performance, all with only a minimal increase in trainable parameters.
πŸ“ Abstract
Fine-tuning MLLMs for Video Temporal Grounding (VTG) often improves in-domain performance but degrades sharply under domain shift. In this work, we find that this failure is primarily driven not just by unseen query concepts, but by visual domain shift, which prevents the model from coupling its learned temporal localization knowledge with its inherent entity-attention capability. To address this, we introduce EVIDENT, a parameter-efficient adaptation framework that anchors temporal grounding in the inherent entity-attention of pre-trained MLLMs by routing VTG adaptation through explicit visual entity evidence. EVIDENT consists of three components: (i) an Entity Bottleneck Adapter that transforms dense visual tokens into compact entity-level slots, (ii) an Entity-Binding Distillation loss that instills objectness priors into the semantically unstructured MLLM visual space, guiding each slot to bind to a coherent entity, and (iii) an Entity-to-eVidence gating mechanism that leverages the captured entities as evidence, steering the model to localize moments containing query-relevant entities. Together, these components enable VTG fine-tuning to rely on entity-grounded evidence rather than brittle dataset shortcuts. Experiments on cross-domain VTG benchmarks show that EVIDENT consistently improves out-of-domain robustness while preserving competitive in-domain performance with modest parameter overhead. These results suggest that entity-level grounding is an effective inductive bias for generalizable temporal localization.
Problem

Research questions and friction points this paper is trying to address.

Video Temporal Grounding
Domain Shift
Multimodal Large Language Models
Entity Attention
Cross-Domain Generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Entity-Grounded Evidence
Video Temporal Grounding
Cross-Domain Adaptation
Multimodal Large Language Models
Parameter-Efficient Tuning
πŸ”Ž Similar Papers