Multimodal Learning 相关度: 9/10

OddGridBench: Exposing the Lack of Fine-Grained Visual Discrepancy Sensitivity in Multimodal Large Language Models

Tengjin Weng, Wenhao Jiang, Jingyi Wang, Ming Li, Lin Ma, Zhong Ming
arXiv: 2603.09326v1 发布: 2026-03-10 更新: 2026-03-10

AI 摘要

该论文提出OddGridBench基准测试MLLM在细粒度视觉差异识别上的能力,并提出OddGrid-GRPO进行优化。

主要贡献

  • 提出OddGridBench基准测试
  • 揭示现有MLLM在细粒度视觉差异识别方面的不足
  • 提出OddGrid-GRPO框架提升模型视觉判别能力

方法论

构建包含差异元素的网格图像数据集,利用强化学习框架OddGrid-GRPO,通过课程学习和距离感知奖励优化模型。

原文摘要

Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis. In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model's fine-grained visual discrimination ability. We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence. Code and dataset are available at https://wwwtttjjj.github.io/OddGridBench/.

标签

MLLM 视觉差异 基准测试 强化学习 视觉感知

arXiv 分类

cs.CV