Multimodal Learning 相关度: 9/10

HanMoVLM: Large Vision-Language Models for Professional Artistic Painting Evaluation

Hongji Yang, Yucheng Zhou, Wencheng Han, Songlian Li, Xiaotong Zhao, Jianbing Shen
arXiv: 2603.10814v1 发布: 2026-03-11 更新: 2026-03-11

AI 摘要

提出HanMoVLM,用于中国艺术绘画的专业评估,提升VLM在艺术领域的理解和评估能力。

主要贡献

  • 构建HanMo-Bench数据集,包含拍卖级真迹和AI生成作品
  • 提出HanMoVLM模型,并采用专家验证的Chain-of-Thought推理
  • 设计奖励函数,优化HanMoVLM的推理过程,提高准确性

方法论

构建数据集,训练VLM进行CoT推理,并设计奖励函数优化推理过程,使其能像专家一样评估中国绘画。

原文摘要

While Large Vision-Language Models (VLMs) demonstrate impressive general visual capabilities, they remain artistically blind and unable to offer professional evaluation of artworks within specific artistic domains like human experts. To bridge this gap, we transform VLMs into experts capable of professional-grade painting evaluation in the Chinese Artistic Domain, which is more abstract and demands extensive artistic training for evaluation. We introduce HanMo-Bench, a new dataset that features authentic auction-grade masterpieces and AI-generated works, grounded in real-world market valuations. To realize the rigorous judgment, we propose the HanMoVLM and construct a Chain-of-Thought (CoT) validated by experts. This CoT guides the model to perform expert-level reasoning: from content identification and Region of Interest (RoI) localization to professional evaluation, guided by both theme-specific evaluation and typical three-tier evaluation in Chinese paintings. Furthermore, we design a reward function to refine the reasoning process of the HanMoVLM to improve the accuracy. We demonstrate that HanMoVLM can serve as a critical backbone for Test-time Scaling in image generation. By acting as a high-quality verifier, HanMoVLM enables generative models to select the most artistically superior outputs from multiple candidates. Experimental results and human studies confirm that the proposed HanMoVLM effectively bridges the gap, achieving a high consistency with professional experts and significantly improving the quality of Chinese Painting generation.

标签

VLM Multimodal Learning Chain-of-Thought Chinese Painting Evaluation

arXiv 分类

cs.CV