Multimodal Learning 相关度: 9/10

Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence

Nikolai Ilinykh, Hyewon Jang, Shalom Lappin, Asad Sayeed, Sharid Loáiciga
arXiv: 2603.25537v1 发布: 2026-03-26 更新: 2026-03-26

AI 摘要

该论文比较了人类和视觉-语言模型在视觉故事叙事连贯性上的差异。

主要贡献

  • 提出一套衡量叙事连贯性的指标
  • 对比分析了人类和VLM生成故事的连贯性
  • 揭示了VLM在叙事组织方面与人类的差异

方法论

使用核心引用、论述关系等指标,计算人类和VLM生成故事的叙事连贯性得分并进行对比。

原文摘要

We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at https://github.com/GU-CLASP/coherence-driven-humans.

标签

叙事连贯性 视觉-语言模型 多模态 评估指标

arXiv 分类

cs.CL