Multimodal Learning 相关度: 8/10

GlyphBanana: Advancing Precise Text Rendering Through Agentic Workflows

Zexuan Yan, Jiarui Jin, Yue Ma, Shijian Wang, Jiahui Hu, Wenxiang Jiao, Yuan Lu, Linfeng Zhang

arXiv: 2603.12155v1 发布: 2026-03-12 更新: 2026-03-12

下载 PDF arXiv 页面

AI 摘要

GlyphBanana通过agentic workflow和glyph模板注入，提升文本渲染的精确度，尤其在复杂字符和公式渲染方面。

主要贡献

提出了 GlyphBanana，一个用于精确文本渲染的agentic workflow
设计了专门用于复杂字符和公式渲染的 benchmark
提出一种训练-自由的方法，可应用于各种 Text-to-Image 模型

方法论

利用agentic workflow，将辅助工具集成到latent space和attention map中，通过glyph模板迭代改进生成图像的精确度。

原文摘要

Despite recent advances in generative models driving significant progress in text rendering, accurately generating complex text and mathematical formulas remains a formidable challenge. This difficulty primarily stems from the limited instruction-following capabilities of current models when encountering out-of-distribution prompts. To address this, we introduce GlyphBanana, alongside a corresponding benchmark specifically designed for rendering complex characters and formulas. GlyphBanana employs an agentic workflow that integrates auxiliary tools to inject glyph templates into both the latent space and attention maps, facilitating the iterative refinement of generated images. Notably, our training-free approach can be seamlessly applied to various Text-to-Image (T2I) models, achieving superior precision compared to existing baselines. Extensive experiments demonstrate the effectiveness of our proposed workflow. Associated code is publicly available at https://github.com/yuriYanZeXuan/GlyphBanana.

arXiv 分类

cs.CV cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类