Multimodal Learning 相关度: 9/10

MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction

Yizhi Li, Xiaohan Chen, Miao Jiang, Wentao Tang, Gaoang Wang
arXiv: 2602.23228v1 发布: 2026-02-26 更新: 2026-02-26

AI 摘要

MovieTeller利用工具增强和渐进式抽象生成ID一致且连贯的电影梗概。

主要贡献

  • 提出了一种无需微调,工具增强的电影梗概生成框架
  • 利用外部人脸识别工具建立事实基础
  • 采用渐进式抽象流程缓解上下文长度限制

方法论

通过人脸识别工具确定角色身份,并将其作为提示引导VLM,再通过多阶段过程进行总结。

原文摘要

With the explosive growth of digital entertainment, automated video summarization has become indispensable for applications such as content indexing, personalized recommendation, and efficient media archiving. Automatic synopsis generation for long-form videos, such as movies and TV series, presents a significant challenge for existing Vision-Language Models (VLMs). While proficient at single-image captioning, these general-purpose models often exhibit critical failures in long-duration contexts, primarily a lack of ID-consistent character identification and a fractured narrative coherence. To overcome these limitations, we propose MovieTeller, a novel framework for generating movie synopses via tool-augmented progressive abstraction. Our core contribution is a training-free, tool-augmented, fact-grounded generation process. Instead of requiring costly model fine-tuning, our framework directly leverages off-the-shelf models in a plug-and-play manner. We first invoke a specialized face recognition model as an external "tool" to establish Factual Groundings--precise character identities and their corresponding bounding boxes. These groundings are then injected into the prompt to steer the VLM's reasoning, ensuring the generated scene descriptions are anchored to verifiable facts. Furthermore, our progressive abstraction pipeline decomposes the summarization of a full-length movie into a multi-stage process, effectively mitigating the context length limitations of current VLMs. Experiments demonstrate that our approach yields significant improvements in factual accuracy, character consistency, and overall narrative coherence compared to end-to-end baselines.

标签

电影梗概 多模态学习 工具增强 视频理解

arXiv 分类

cs.CV cs.AI