AI Agents 相关度: 9/10

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis

Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, Dasen Dai, Bolin Jiang, Manyuan Zhang, Shi-Xue Zhang, Zhengkai Jiang, Lucas Wang, Zhao Zhong, Yu Cheng, Nanyun Peng
arXiv: 2603.29620v1 发布: 2026-03-31 更新: 2026-03-31

AI 摘要

Unify-Agent通过Agent框架,提升了世界知识驱动的图像生成质量。

主要贡献

  • 提出了Unify-Agent,一个用于世界知识驱动图像生成的统一多模态Agent。
  • 构建了一个高质量的多模态数据管道,包含143K Agent轨迹。
  • 提出了FactIP基准,用于评估模型在长尾知识概念上的表现。

方法论

将图像生成建模为包含提示理解、证据搜索、内容改写和最终合成的Agent流水线,并利用定制数据进行训练。

原文摘要

Unified multimodal models provide a natural and promising architecture for understanding diverse and complex real-world knowledge while generating high-quality images. However, they still rely primarily on frozen parametric knowledge, which makes them struggle with real-world image generation involving long-tail and knowledge-intensive concepts. Inspired by the broad success of agents on real-world tasks, we explore agentic modeling to address this limitation. Specifically, we present Unify-Agent, a unified multimodal agent for world-grounded image synthesis, which reframes image generation as an agentic pipeline consisting of prompt understanding, multimodal evidence searching, grounded recaptioning, and final synthesis. To train our model, we construct a tailored multimodal data pipeline and curate 143K high-quality agent trajectories for world-grounded image synthesis, enabling effective supervision over the full agentic generation process. We further introduce FactIP, a benchmark covering 12 categories of culturally significant and long-tail factual concepts that explicitly requires external knowledge grounding. Extensive experiments show that our proposed Unify-Agent substantially improves over its base unified model across diverse benchmarks and real world generation tasks, while approaching the world knowledge capabilities of the strongest closed-source models. As an early exploration of agent-based modeling for world-grounded image synthesis, our work highlights the value of tightly coupling reasoning, searching, and generation for reliable open-world agentic image synthesis.

标签

AI Agents Multimodal Learning Image Generation World Knowledge

arXiv 分类

cs.CV cs.MM