🤖 AI Summary
Real-world objects frequently undergo significant appearance changes during state transitions (e.g., an apple being sliced, a butterfly emerging from its chrysalis), causing conventional tracking methods to fail. To address this, we introduce “Track Any State” (TAS), a novel video object segmentation task focused on state transitions, and present VOST-TAS—the first benchmark explicitly designed for this challenge. To enable zero-shot, cross-morphological, and temporally coherent tracking and understanding, we propose TubeletGraph: a model that treats tubelets (spatiotemporal object segments) as fundamental units, integrating semantic priors with spatiotemporal reasoning to jointly model state evolution, occluded-target recovery, and state-graph construction. On VOST-TAS, TubeletGraph substantially outperforms existing methods, achieving—for the first time—simultaneous precise temporal localization, fine-grained state description, and cross-morphological semantic reasoning. This work establishes a new paradigm for dynamic object understanding.
📝 Abstract
Real-world objects frequently undergo state transformations. From an apple being cut into pieces to a butterfly emerging from its cocoon, tracking through these changes is important for understanding real-world objects and dynamics. However, existing methods often lose track of the target object after transformation, due to significant changes in object appearance. To address this limitation, we introduce the task of Track Any State: tracking objects through transformations while detecting and describing state changes, accompanied by a new benchmark dataset, VOST-TAS. To tackle this problem, we present TubeletGraph, a zero-shot system that recovers missing objects after transformation and maps out how object states are evolving over time. TubeletGraph first identifies potentially overlooked tracks, and determines whether they should be integrated based on semantic and proximity priors. Then, it reasons about the added tracks and generates a state graph describing each observed transformation. TubeletGraph achieves state-of-the-art tracking performance under transformations, while demonstrating deeper understanding of object transformations and promising capabilities in temporal grounding and semantic reasoning for complex object transformations. Code, additional results, and the benchmark dataset are available at https://tubelet-graph.github.io.