Multimodal Learning 相关度: 9/10

I Came, I Saw, I Explained: Benchmarking Multimodal LLMs on Figurative Meaning in Memes

Shijia Zhou, Saif M. Mohammad, Barbara Plank, Diego Frassinelli

arXiv: 2603.23229v1 发布: 2026-03-24 更新: 2026-03-24

下载 PDF arXiv 页面

AI 摘要

该论文评估了多模态大语言模型在理解 Meme 中隐喻意义的能力，发现模型存在偏见且解释不忠实。

主要贡献

评估了MLLM在Meme隐喻意义理解上的表现
揭示了模型对隐喻意义的偏见
分析了模型解释的忠实度问题

方法论

在三个数据集上，评估了8个MLLM检测和解释六种隐喻意义的能力，并进行了人工评估。

原文摘要

Internet memes represent a popular form of multimodal online communication and often use figurative elements to convey layered meaning through the combination of text and images. However, it remains largely unclear how multimodal large language models (MLLMs) combine and interpret visual and textual information to identify figurative meaning in memes. To address this gap, we evaluate eight state-of-the-art generative MLLMs across three datasets on their ability to detect and explain six types of figurative meaning. In addition, we conduct a human evaluation of the explanations generated by these MLLMs, assessing whether the provided reasoning supports the predicted label and whether it remains faithful to the original meme content. Our findings indicate that all models exhibit a strong bias to associate a meme with figurative meaning, even when no such meaning is present. Qualitative analysis further shows that correct predictions are not always accompanied by faithful explanations.

arXiv 分类

cs.CL

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类