Multimodal Learning 相关度: 10/10

Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory

Shunki Uebayashi, Kento Masui, Kyohei Atarashi, Han Bao, Hisashi Kashima, Naoto Inoue, Mayu Otani, Koh Takeuchi
arXiv: 2603.02663v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

提出M3IRT框架,用于评估MLLM的跨模态推理能力,并优化多模态benchmark。

主要贡献

  • 提出了多模态多维度项目反应理论框架(M3IRT)
  • 利用M3IRT评估MLLM的跨模态推理能力和问题难度
  • 利用M3IRT优化多模态benchmark,提高评估效率

方法论

扩展经典IRT,将模型能力和问题难度分解为图像、文本和跨模态成分,评估MLLM的跨模态能力。

原文摘要

Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding unreliable rankings. For example, in vision-language cases, we can find the correct answer without either the image or the text. These low-quality questions unnecessarily increase the size and computational requirements of benchmarks. We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT by decomposing both model ability and item difficulty into image-only, text-only, and cross-modal components. M3IRT estimates cross-modal ability of MLLMs and each question's cross-modal difficulty, enabling compact, high-quality subsets that better reflect multimodal reasoning. Across 24 VLMs on three benchmarks, M3IRT prioritizes genuinely cross-modal questions over shortcuts and preserves ranking fidelity even when 50% of items are artificially generated low-quality questions, thereby reducing evaluation cost while improving reliability. M3IRT thus offers a practical tool for assessing cross-modal reasoning and refining multimodal benchmarks.

标签

多模态 MLLM 评估 Item Response Theory Benchmark

arXiv 分类

cs.CL cs.CV