🤖 AI Summary
This work addresses the challenge of linguistic reference discontinuity and re-identification difficulty in fixed-view videos caused by prolonged occlusion or object departure. To maintain referential coherence during target absence, the authors propose constructing an offline anchor library from the static background, where text-aligned anchor maps serve as persistent semantic memory. An anchor-driven re-entry prior combined with displacement-aware cues enables a lightweight ReID-Gating mechanism for efficient target recapture, without requiring initial-frame visibility or explicit modeling of appearance dynamics. Experiments demonstrate a 10.3% improvement in recapture rate and a 24.2% reduction in latency over the strongest baseline, while ablation studies confirm the contribution of each component.
📝 Abstract
Long-term language-guided referring in fixed-view videos is challenging: the referent may be occluded or leave the scene for long intervals and later re-enter, while framewise referring pipelines drift as re-identification (ReID) becomes unreliable. AR2-4FV leverages background stability for long-term referring. An offline Anchor Bank is distilled from static background structures; at inference, the text query is aligned with this bank to produce an Anchor Map that serves as persistent semantic memory when the referent is absent. An anchor-based re-entry prior accelerates re-capture upon return, and a lightweight ReID-Gating mechanism maintains identity continuity using displacement cues in the anchor frame. The system predicts per-frame bounding boxes without assuming the target is visible in the first frame or explicitly modeling appearance variations. AR2-4FV achieves +10.3% Re-Capture Rate (RCR) improvement and -24.2% Re-Capture Latency (RCL) reduction over the best baseline, and ablation studies further confirm the benefits of the Anchor Map, re-entry prior, and ReID-Gating.