Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis
AI 摘要
Unify-Agent通过Agent框架,提升了世界知识驱动的图像生成质量。
主要贡献
- 提出了Unify-Agent,一个用于世界知识驱动图像生成的统一多模态Agent。
- 构建了一个高质量的多模态数据管道,包含143K Agent轨迹。
- 提出了FactIP基准,用于评估模型在长尾知识概念上的表现。
方法论
将图像生成建模为包含提示理解、证据搜索、内容改写和最终合成的Agent流水线,并利用定制数据进行训练。
原文摘要
Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.