Multimodal Learning 相关度: 9/10

RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward

Qiucheng Wu, Jing Shi, Simon Jenni, Kushal Kafle, Tianyu Wang, Shiyu Chang, Handong Zhao

arXiv: 2602.17558v1 发布: 2026-02-19 更新: 2026-02-19

下载 PDF arXiv 页面

AI 摘要

RetouchIQ提出了一种基于通用奖励模型的MLLM图像润饰框架，提升了图像编辑的语义一致性和感知质量。

主要贡献

提出了RetouchIQ框架，用于指令驱动的可执行图像编辑。
提出了通用奖励模型，利用RL微调MLLM来评估编辑结果。
构建了包含19万指令-推理对的图像编辑数据集，并设立了新的基准。

方法论

使用MLLM作为Agent，通过强化学习在图像编辑软件中进行工具使用规划，利用通用奖励模型指导训练。

原文摘要

Recent advances in multimodal large language models (MLLMs) have shown great potential for extending vision-language reasoning to professional tool-based image editing, enabling intuitive and creative editing. A promising direction is to use reinforcement learning (RL) to enable MLLMs to reason about and execute optimal tool-use plans within professional image-editing software. However, training remains challenging due to the lack of reliable, verifiable reward signals that can reflect the inherently subjective nature of creative editing. In this work, we introduce RetouchIQ, a framework that performs instruction-based executable image editing through MLLM agents guided by a generalist reward model. RetouchIQ interprets user-specified editing intentions and generates corresponding, executable image adjustments, bridging high-level aesthetic goals with precise parameter control. To move beyond conventional, rule-based rewards that compute similarity against a fixed reference image using handcrafted metrics, we propose a generalist reward model, an RL fine-tuned MLLM that evaluates edited results through a set of generated metrics on a case-by-case basis. Then, the reward model provides scalar feedback through multimodal reasoning, enabling reinforcement learning with high-quality, instruction-consistent gradients. We curate an extended dataset with 190k instruction-reasoning pairs and establish a new benchmark for instruction-based image editing. Experiments show that RetouchIQ substantially improves both semantic consistency and perceptual quality over previous MLLM-based and diffusion-based editing systems. Our findings demonstrate the potential of generalist reward-driven MLLM agents as flexible, explainable, and executable assistants for professional image editing.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类