Multimodal Learning 相关度: 9/10

R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment

Zhuangzi Li, Jian Jin, Shilv Cai, Weisi Lin

arXiv: 2603.10578v1 发布: 2026-03-11 更新: 2026-03-11

下载 PDF arXiv 页面

AI 摘要

针对CG图像质量评估，提出基于检索增强的VLM框架R4-CGQA，提升VLM对CG图像质量的评估能力。

主要贡献

构建了包含CG图像及质量描述的数据集
提出了基于检索增强的双流框架R4-CGQA
验证了该方法能够有效提升VLM的CG质量评估性能

方法论

构建数据集，采用检索增强生成方法，构建双流检索框架，利用视觉相似图像的描述来提升VLM理解。

原文摘要

Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: First, existing CG datasets lack systematic descriptions of rendering quality; and second existing CG quality assessment methods cannot provide reasonable text-based explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of 3500 CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question-answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM's understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. Experiments on several representative VLMs demonstrate that our method substantially improves their performance on CG quality assessment.

arXiv 分类

cs.CV cs.DB

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类