TrajTok: Learning Trajectory Tokens enables better Video Understanding
AI 摘要
提出TrajTok视频tokenizer,通过联合训练动态分割视频轨迹,提升视频理解性能和效率。
主要贡献
- 提出端到端视频tokenizer模块TrajTok,与下游任务联合训练。
- TrajTok通过隐式聚类提取时空轨迹,无需外部分割和跟踪流水线。
- TrajTok在视频分类、检索和长视频推理上表现出色。
方法论
TrajTok使用统一分割器在时空中进行隐式聚类,直接生成目标轨迹,并与下游视频模型联合训练。
原文摘要
Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.