Multimodal Learning 相关度: 9/10

Understanding Degradation with Vision Language Model

Guanzhou Lan, Chenyi Liao, Yuqi Yang, Qianli Ma, Zhigang Wang, Dong Wang, Bin Zhao, Xuelong Li
arXiv: 2602.04565v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

提出DU-VLM模型,用于理解图像退化并用于图像复原,通过分层结构预测任务和多模态链式思考实现。

主要贡献

  • 重新定义图像退化理解为分层结构预测任务
  • 提出DU-VLM模型,基于autoregressive next-token prediction范式
  • 构建大规模数据集DU-110k,包含带有物理标注的清洁-退化图像对

方法论

使用监督微调和强化学习训练多模态链式思考模型DU-VLM,用于预测退化类型、参数键和物理值,并将其应用于图像复原。

原文摘要

Understanding visual degradations is a critical yet challenging problem in computer vision. While recent Vision-Language Models (VLMs) excel at qualitative description, they often fall short in understanding the parametric physics underlying image degradations. In this work, we redefine degradation understanding as a hierarchical structured prediction task, necessitating the concurrent estimation of degradation types, parameter keys, and their continuous physical values. Although these sub-tasks operate in disparate spaces, we prove that they can be unified under one autoregressive next-token prediction paradigm, whose error is bounded by the value-space quantization grid. Building on this insight, we introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning using structured rewards. Furthermore, we show that DU-VLM can serve as a zero-shot controller for pre-trained diffusion models, enabling high-fidelity image restoration without fine-tuning the generative backbone. We also introduce \textbf{DU-110k}, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations. Extensive experiments demonstrate that our approach significantly outperforms generalist baselines in both accuracy and robustness, exhibiting generalization to unseen distributions.

标签

图像退化理解 Vision-Language Model 多模态学习 图像复原

arXiv 分类

cs.CV