Understanding Degradation with Vision Language Model
AI 摘要
提出DU-VLM模型,用于理解图像退化并用于图像复原,通过分层结构预测任务和多模态链式思考实现。
主要贡献
- 重新定义图像退化理解为分层结构预测任务
- 提出DU-VLM模型,基于autoregressive next-token prediction范式
- 构建大规模数据集DU-110k,包含带有物理标注的清洁-退化图像对
方法论
使用监督微调和强化学习训练多模态链式思考模型DU-VLM,用于预测退化类型、参数键和物理值,并将其应用于图像复原。
原文摘要
Understanding visual degradations is a critical yet challenging problem in computer vision. While recent Vision-Language Models (VLMs) excel at qualitative description, they often fall short in understanding the parametric physics underlying image degradations. In this work, we redefine degradation understanding as a hierarchical structured prediction task, necessitating the concurrent estimation of degradation types, parameter keys, and their continuous physical values. Although these sub-tasks operate in disparate spaces, we prove that they can be unified under one autoregressive next-token prediction paradigm, whose error is bounded by the value-space quantization grid. Building on this insight, we introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning using structured rewards. Furthermore, we show that DU-VLM can serve as a zero-shot controller for pre-trained diffusion models, enabling high-fidelity image restoration without fine-tuning the generative backbone. We also introduce \textbf{DU-110k}, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations. Extensive experiments demonstrate that our approach significantly outperforms generalist baselines in both accuracy and robustness, exhibiting generalization to unseen distributions.