Multimodal Learning 相关度: 9/10

Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

Ping Chen, Daoxuan Zhang, Xiangming Wang, Yungeng Liu, Haijin Zeng, Yongyong Chen
arXiv: 2603.18627v1 发布: 2026-03-19 更新: 2026-03-19

AI 摘要

提出AFS-Search闭环框架,通过VLM引导,提升空间约束文本生成图像质量。

主要贡献

  • 引入AFS-Search框架,解决T2I生成中的语义歧义和误差累积问题
  • 利用VLM作为语义批评家,动态引导生成过程
  • 提出并行Rollout搜索策略,优化生成路径

方法论

构建在FLUX.1-dev基础上,利用VLM指导,将T2I生成转化为序列决策问题,进行闭环优化。

原文摘要

Precise Text-to-Image (T2I) generation has achieved great success but is hindered by the limited relational reasoning of static text encoders and the error accumulation in open-loop sampling. Without real-time feedback, initial semantic ambiguities during the Ordinary Differential Equation trajectory inevitably escalate into stochastic deviations from spatial constraints. To bridge this gap, we introduce AFS-Search (Agentic Flow Steering and Parallel Rollout Search), a training-free closed-loop framework built upon FLUX.1-dev. AFS-Search incorporates a training-free closed-loop parallel rollout search and flow steering mechanism, which leverages a Vision-Language Model (VLM) as a semantic critic to diagnose intermediate latents and dynamically steer the velocity field via precise spatial grounding. Complementarily, we formulate T2I generation as a sequential decision-making process, exploring multiple trajectories through lookahead simulations and selecting the optimal path based on VLM-guided rewards. Further, we provide AFS-Search-Pro for higher performance and AFS-Search-Fast for quicker generation. Experimental results show that our AFS-Search-Pro greatly boosts the performance of the original FLUX.1-dev, achieving state-of-the-art results across three different benchmarks. Meanwhile, AFS-Search-Fast also significantly enhances performance while maintaining fast generation speed.

标签

文本生成图像 VLM 闭环优化 Rollout Search

arXiv 分类

cs.AI