🤖 AI Summary
Existing video recognition models rely on fixed, coarse-grained taxonomies that struggle to adapt cost-effectively to evolving demands for fine-grained categories. This work introduces and addresses, for the first time, the zero-shot category splitting problem for video classifiers: by uncovering the latent compositional structure within a trained classifier, it automatically refines coarse categories into meaningful subcategories without requiring any additional labeled data. The approach further incorporates few-shot fine-tuning to enhance performance on the newly split classes. Evaluated on a newly established video category splitting benchmark, our method significantly outperforms vision-language baselines, achieving substantial gains in accuracy on novel subcategories while preserving original classification performance on parent categories.
📝 Abstract
Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video classifiers to expose fine-grained distinctions without additional data. We further show that low-shot fine-tuning, while simple, is highly effective and benefits from our zero-shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision-language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest. Project page: https://kaitingliu.github.io/Category-Splitting/.