Evaluating Cross-Modal Reasoning Ability and Problem Characteristics with Multimodal Item Response Theory
AI 摘要
提出M3IRT框架,用于评估MLLM的跨模态推理能力,并优化多模态benchmark。
主要贡献
- 提出了多模态多维度项目反应理论框架(M3IRT)
- 利用M3IRT评估MLLM的跨模态推理能力和问题难度
- 利用M3IRT优化多模态benchmark,提高评估效率
方法论
扩展经典IRT,将模型能力和问题难度分解为图像、文本和跨模态成分,评估MLLM的跨模态能力。
原文摘要
Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding unreliable rankings. For example, in vision-language cases, we can find the correct answer without either the image or the text. These low-quality questions unnecessarily increase the size and computational requirements of benchmarks. We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT by decomposing both model ability and item difficulty into image-only, text-only, and cross-modal components. M3IRT estimates cross-modal ability of MLLMs and each question's cross-modal difficulty, enabling compact, high-quality subsets that better reflect multimodal reasoning. Across 24 VLMs on three benchmarks, M3IRT prioritizes genuinely cross-modal questions over shortcuts and preserves ranking fidelity even when 50% of items are artificially generated low-quality questions, thereby reducing evaluation cost while improving reliability. M3IRT thus offers a practical tool for assessing cross-modal reasoning and refining multimodal benchmarks.