Multimodal Learning 相关度: 9/10

VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration

Jaeyoon Jung, Yejun Yoon, Seunghyun Yoon, Kunwoo Park

arXiv: 2602.04587v1 发布: 2026-02-04 更新: 2026-02-04

下载 PDF arXiv 页面

AI 摘要

VILLAIN系统通过多智能体协作，使用视觉-语言模型验证图像-文本声明，并在AVerImaTeC任务中取得领先。

主要贡献

提出基于prompt的多智能体协作框架
利用知识库和网络信息增强证据
在AVerImaTeC任务中取得领先

方法论

采用多智能体协作，分别进行文本和视觉证据检索、信息分析、问答生成，最后预测结果。

原文摘要

This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision-language model agents across multiple stages of fact-checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality-specific and cross-modal agents generate analysis reports. In the subsequent stage, question-answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image-text claim and the generated question-answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at https://github.com/ssu-humane/VILLAIN.

arXiv 分类

cs.CL cs.AI cs.CY

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类