Multimodal Learning 相关度: 9/10

VTEdit-Bench: A Comprehensive Benchmark for Multi-Reference Image Editing Models in Virtual Try-On

Xiaoye Liang, Zhiyuan Qu, Mingye Zou, Jiaxin Liu, Lai Jiang, Mai Xu, Yiheng Zhu
arXiv: 2603.11734v1 发布: 2026-03-12 更新: 2026-03-12

AI 摘要

提出了VTEdit-Bench,用于评估通用图像编辑模型在虚拟试穿任务中的性能。

主要贡献

  • 构建了VTEdit-Bench基准数据集,包含多种复杂虚拟试穿场景。
  • 提出了VTEdit-QA,一个基于VLM的、参考感知的评估器。
  • 系统性地评估了通用图像编辑模型和专用虚拟试穿模型。

方法论

构建数据集用于评估,提出基于VLM的评估指标,对比不同模型在数据集上的表现。

原文摘要

As virtual try-on (VTON) continues to advance, a growing number of real-world scenarios have emerged, pushing beyond the ability of the existing specialized VTON models. Meanwhile, universal multi-reference image editing models have progressed rapidly and exhibit strong generalization in visual editing, suggesting a promising route toward more flexible VTON systems. However, despite their strong capabilities, the strengths and limitations of universal editors for VTON remain insufficiently explored due to the lack of systematic evaluation benchmarks. To address this gap, we introduce VTEdit-Bench, a comprehensive benchmark designed to evaluate universal multi-reference image editing models across various realistic VTON scenarios. VTEdit-Bench contains 24,220 test image pairs spanning five representative VTON tasks with progressively increasing complexity, enabling systematic analysis of robustness and generalization. We further propose VTEdit-QA, a reference-aware VLM-based evaluator that assesses VTON performance from three key aspects: model consistency, cloth consistency, and overall image quality. Through this framework, we systematically evaluate eight universal editing models and compare them with seven specialized VTON models. Results show that top universal editors are competitive on conventional tasks and generalize more stably to harder scenarios, but remain challenged by complex reference configurations, particularly multi-cloth conditioning.

标签

虚拟试穿 图像编辑 基准数据集 多模态 评估指标

arXiv 分类

cs.CV