Multimodal Learning 相关度: 9/10

TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering

Hanshen Zhu, Yuliang Liu, Xuecheng Wu, An-Lan Wang, Hao Feng, Dingkang Yang, Chao Feng, Can Huang, Jingqun Tang, Xiang Bai
arXiv: 2602.20903v1 发布: 2026-02-24 更新: 2026-02-24

AI 摘要

TextPecker通过量化结构异常来提升视觉文本渲染的保真度和语义对齐。

主要贡献

  • 提出了TextPecker,一种可插拔的结构异常感知强化学习策略。
  • 构建了带有字符级结构异常注释的识别数据集。
  • 开发了笔画编辑合成引擎以扩展结构错误覆盖范围。

方法论

构建结构异常感知强化学习策略,训练模型识别和纠正文本渲染中的结构异常,提升文本保真度。

原文摘要

Visual Text Rendering (VTR) remains a critical challenge in text-to-image generation, where even advanced models frequently produce text with structural anomalies such as distortion, blurriness, and misalignment. However, we find that leading MLLMs and specialist OCR models largely fail to perceive these structural anomalies, creating a critical bottleneck for both VTR evaluation and RL-based optimization. As a result, even state-of-the-art generators (e.g., SeedDream4.0, Qwen-Image) still struggle to render structurally faithful text. To address this, we propose TextPecker, a plug-and-play structural anomaly perceptive RL strategy that mitigates noisy reward signals and works with any textto-image generator. To enable this capability, we construct a recognition dataset with character-level structural-anomaly annotations and develop a stroke-editing synthesis engine to expand structural-error coverage. Experiments show that TextPecker consistently improves diverse text-to-image models; even on the well-optimized Qwen-Image, it significantly yields average gains of 4% in structural fidelity and 8.7% in semantic alignment for Chinese text rendering, establishing a new state-of-the-art in high-fidelity VTR. Our work fills a gap in VTR optimization, providing a foundational step towards reliable and structural faithful visual text generation.

标签

视觉文本渲染 强化学习 结构异常检测

arXiv 分类

cs.CV