Multimodal Learning 相关度: 9/10

R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment

Zhuangzi Li, Jian Jin, Shilv Cai, Weisi Lin
arXiv: 2603.10578v1 发布: 2026-03-11 更新: 2026-03-11

AI 摘要

针对CG图像质量评估,提出基于检索增强的VLM框架R4-CGQA,提升VLM对CG图像质量的评估能力。

主要贡献

  • 构建了包含CG图像及质量描述的数据集
  • 提出了基于检索增强的双流框架R4-CGQA
  • 验证了该方法能够有效提升VLM的CG质量评估性能

方法论

构建数据集,采用检索增强生成方法,构建双流检索框架,利用视觉相似图像的描述来提升VLM理解。

原文摘要

Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: First, existing CG datasets lack systematic descriptions of rendering quality; and second existing CG quality assessment methods cannot provide reasonable text-based explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of 3500 CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question-answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM's understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. Experiments on several representative VLMs demonstrate that our method substantially improves their performance on CG quality assessment.

标签

CG图像质量评估 视觉语言模型 检索增强 多模态学习

arXiv 分类

cs.CV cs.DB