Multimodal Learning 相关度: 9/10

Cluster-Wise Spatio-Temporal Masking for Efficient Video-Language Pretraining

Weijun Zhuang, Yuqing Huang, Weikang Meng, Xin Li, Ming Liu, Xiaopeng Hong, Yaowei Wang, Wangmeng Zuo
arXiv: 2603.22953v1 发布: 2026-03-24 更新: 2026-03-24

AI 摘要

ClusterSTM提出一种簇级时空掩码策略,提升视频语言预训练的效率和性能。

主要贡献

  • 提出簇级时空掩码策略,缓解信息损失和时间泄露问题
  • 引入视频-文本相关性重建目标,增强多模态语义对齐
  • 在多个视频语言任务上取得先进性能

方法论

通过帧内聚类划分视觉token,并保留簇内时间密度最高的token,进行簇级掩码,同时进行视频-文本相关性重建。

原文摘要

Large-scale video-language pretraining enables strong generalization across multimodal tasks but often incurs prohibitive computational costs. Although recent advances in masked visual modeling help mitigate this issue, they still suffer from two fundamental limitations: severe visual information loss under high masking ratios and temporal information leakage caused by inter-frame correlations. To address these challenges, we propose ClusterSTM, a Cluster-Wise Spatio-Temporal Masking strategy for efficient video-language pretraining. ClusterSTM first performs intra-frame clustering to partition visual tokens into multiple semantically independent clusters, then conducts cluster-wise masking by retaining the token with the highest temporal density within each cluster. Our masking strategy ensure that the retained tokens capture holistic video content while exhibit strong temporal correlation. Additionally, we introduce a video-text relevance reconstruction objective that aligns high-level multimodal semantics beyond conventional visual reconstruction. Extensive experiments across multiple benchmarks demonstrate that ClusterSTM achieves superior performance on video-text retrieval, video question answering, and video captioning tasks, establishing a new state-of-the-art among efficient video-language models.

标签

视频语言预训练 掩码视觉建模 多模态学习

arXiv 分类

cs.CV