Multimodal Learning 相关度: 8/10

CAST: Modeling Visual State Transitions for Consistent Video Retrieval

Yanqing Liu, Yingcheng Liu, Fanghong Dong, Budianto Budianto, Cihang Xie, Yan Jiao
arXiv: 2603.08648v1 发布: 2026-03-09 更新: 2026-03-09

AI 摘要

CAST模型通过预测视觉状态转换,提升了视频检索的一致性和时间连贯性。

主要贡献

  • 提出了Consistent Video Retrieval (CVR)任务
  • 设计了CAST模型,用于建模视觉状态转换
  • 构建了用于CVR的诊断基准

方法论

CAST通过预测状态条件的残差更新,在视觉语言嵌入空间中引入显式的潜在状态演化归纳偏置。

原文摘要

As video content creation shifts toward long-form narratives, composing short clips into coherent storylines becomes increasingly important. However, prevailing retrieval formulations remain context-agnostic at inference time, prioritizing local semantic alignment while neglecting state and identity consistency. To address this structural limitation, we formalize the task of Consistent Video Retrieval (CVR) and introduce a diagnostic benchmark spanning YouCook2, COIN, and CrossTask. We propose CAST (Context-Aware State Transition), a lightweight, plug-and-play adapter compatible with diverse frozen vision-language embedding spaces. By predicting a state-conditioned residual update ($Δ$) from visual history, CAST introduces an explicit inductive bias for latent state evolution. Extensive experiments show that CAST improves performance on YouCook2 and CrossTask, remains competitive on COIN, and consistently outperforms zero-shot baselines across diverse foundation backbones. Furthermore, CAST provides a useful reranking signal for black-box video generation candidates (e.g., from Veo), promoting more temporally coherent continuations.

标签

视频检索 视觉语言模型 状态转换 时间一致性

arXiv 分类

cs.CV