GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing
AI 摘要
GRADE基准测试学科知识驱动的图像编辑推理能力,揭示现有模型在该领域的不足。
主要贡献
- 提出了GRADE基准数据集,包含10个学科领域的520个样本
- 提出了多维度评估协议,评估学科推理、视觉一致性和逻辑可读性
- 评估了20个SOTA模型,发现模型在知识密集型编辑任务中存在显著差距
方法论
构建多学科图像编辑数据集,设计多维度评估指标,对现有模型进行评估和分析,发现模型在知识推理上的不足。
原文摘要
Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, we introduce GRADE, the first benchmark to assess discipline-informed knowledge and reasoning in image editing. GRADE comprises 520 carefully curated samples across 10 academic domains, spanning from natural science to social science. To support rigorous evaluation, we propose a multi-dimensional evaluation protocol that jointly assesses Discipline Reasoning, Visual Consistency, and Logical Readability. Extensive experiments on 20 state-of-the-art open-source and closed-source models reveal substantial limitations in current models under implicit, knowledge-intensive editing settings, leading to large performance gaps. Beyond quantitative scores, we conduct rigorous analyses and ablations to expose model shortcomings and identify the constraints within disciplinary editing. Together, GRADE pinpoints key directions for the future development of unified multimodal models, advancing the research on discipline-informed image editing and reasoning. Our benchmark and evaluation code are publicly released.