Let's Split Up: Zero-Shot Classifier Edits for Fine-Grained Video Understanding
AI 摘要
提出视频分类拆分任务,无需额外数据即可将粗粒度类别拆分为细粒度子类别,提升视频理解精度。
主要贡献
- 提出类别拆分任务,用于细粒度视频理解。
- 提出零样本拆分方法,利用视频分类器的潜在组合结构。
- 构建新的视频类别拆分基准测试集。
方法论
利用视频分类器的潜在组合结构,通过零样本编辑实现类别拆分,并使用少量样本微调进一步提升性能。
原文摘要
Video recognition models are typically trained on fixed taxonomies which are often too coarse, collapsing distinctions in object, manner or outcome under a single label. As tasks and definitions evolve, such models cannot accommodate emerging distinctions and collecting new annotations and retraining to accommodate such changes is costly. To address these challenges, we introduce category splitting, a new task where an existing classifier is edited to refine a coarse category into finer subcategories, while preserving accuracy elsewhere. We propose a zero-shot editing method that leverages the latent compositional structure of video classifiers to expose fine-grained distinctions without additional data. We further show that low-shot fine-tuning, while simple, is highly effective and benefits from our zero-shot initialization. Experiments on our new video benchmarks for category splitting demonstrate that our method substantially outperforms vision-language baselines, improving accuracy on the newly split categories without sacrificing performance on the rest. Project page: https://kaitingliu.github.io/Category-Splitting/.