GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?
AI 摘要
GenArena提出一种基于pairwise比较的视觉生成模型评估框架,提升了评估的稳定性和与人类感知的对齐。
主要贡献
- 发现了pointwise评估方法的局限性
- 提出了基于pairwise比较的GenArena评估框架
- 证明了GenArena能有效提升评估准确性并与人类感知更对齐
方法论
该论文采用pairwise比较方法,通过让模型比较不同生成结果的优劣,避免绝对评分带来的偏差,从而更准确地评估视觉生成模型。
原文摘要
The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves a Spearman correlation of 0.86 with the authoritative LMArena leaderboard, drastically surpassing the 0.36 correlation of pointwise methods. Based on GenArena, we benchmark state-of-the-art visual generation models across diverse tasks, providing the community with a rigorous and automated evaluation standard for visual generation.