Multimodal Learning 相关度: 9/10

V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge in Vision Language Models

Seyed Mahed Mousavi, Christian Moiola, Massimo Rizzoli, Simone Alghisi, Giuseppe Riccardi

arXiv: 2603.16581v1 发布: 2026-03-17 更新: 2026-03-17

下载 PDF arXiv 页面

AI 摘要

V-DyKnow基准测试评估VLMs在时间敏感知识上的表现，揭示了模型在事实更新和跨模态一致性方面的局限性。

主要贡献

提出了V-DyKnow基准测试，用于评估VLMs的时间敏感知识
分析了VLMs在跨模态和输入扰动下的可靠性
研究了知识编辑和多模态RAG方法对知识更新的有效性

方法论

构建视觉动态知识基准，通过分析模型在不同模态和时间下的表现，评估其知识更新能力和可靠性，并进行数据和机制分析。

原文摘要

Vision-Language Models (VLMs) are trained on data snapshots of documents, including images and texts. Their training data and evaluation benchmarks are typically static, implicitly treating factual knowledge as time-invariant. However, real-world facts are intrinsically time-sensitive and subject to erratic and periodic changes, causing model predictions to become outdated. We present V-DyKnow, a Visual Dynamic Knowledge benchmark for evaluating time-sensitive factual knowledge in VLMs. Using V-DyKnow, we benchmark closed- and open-source VLMs and analyze a) the reliability (correctness and consistency) of model responses across modalities and input perturbations; b) the efficacy of knowledge editing and multi-modal RAG methods for knowledge updates across modalities; and c) the sources of outdated predictions, through data and mechanistic analysis. Our results show that VLMs frequently output outdated facts, reflecting outdated snapshots used in the (pre-)training phase. Factual reliability degrades from textual to visual stimuli, even when entities are correctly recognized. Besides, existing alignment approaches fail to consistently update the models' knowledge across modalities. Together, these findings highlight fundamental limitations in how current VLMs acquire and update time-sensitive knowledge across modalities. We release the benchmark, code, and evaluation data.

arXiv 分类

cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类