๐ค AI Summary
This work addresses the limited availability of high-quality, diverse data that hinders comprehensive evaluation of generalization in existing point tracking models under complex scenarios. To this end, we introduce SynthVerseโa large-scale, high-fidelity synthetic dataset that, for the first time, encompasses novel domains including animated styles, embodied manipulation, scene navigation, and articulated objects, enabling precise trajectory annotations under challenging conditions such as complex motion, occlusion, and viewpoint variation. Leveraging this dataset, we establish the first cross-domain point tracking benchmark. Experiments demonstrate that SynthVerse substantially improves model generalization and exposes critical limitations of current trackers in handling diverse dynamic environments.
๐ Abstract
Point tracking aims to follow visual points through complex motion, occlusion, and viewpoint changes, and has advanced rapidly with modern foundation models. Yet progress toward general point tracking remains constrained by limited high-quality data, as existing datasets often provide insufficient diversity and imperfect trajectory annotations. To this end, we introduce SynthVerse, a large-scale, diverse synthetic dataset specifically designed for point tracking. SynthVerse includes several new domains and object types missing from existing synthetic datasets, such as animated-film-style content, embodied manipulation, scene navigation, and articulated objects. SynthVerse substantially expands dataset diversity by covering a broader range of object categories and providing high-quality dynamic motions and interactions, enabling more robust training and evaluation for general point tracking. In addition, we establish a highly diverse point tracking benchmark to systematically evaluate state-of-the-art methods under broader domain shifts. Extensive experiments and analyses demonstrate that training with SynthVerse yields consistent improvements in generalization and reveal limitations of existing trackers under diverse settings.