LanteRn: Latent Visual Structured Reasoning
AI 摘要
LanteRn通过在LLM中引入紧凑的潜在视觉表征,提升了多模态推理中细粒度的视觉理解能力。
主要贡献
- 提出了 LanteRn 框架,允许 LMM 在潜在空间中进行视觉推理
- 使用监督微调和强化学习训练模型,对齐视觉特征和任务效用
- 在多个视觉推理基准测试中取得了显著提升
方法论
LanteRn增强了vision-language transformer,使其能够生成和关注连续的视觉思想嵌入。 通过监督微调和强化学习进行训练。
原文摘要
While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning directly in pixel space. In this paper, we introduce LanteRn, a framework that enables LMMs to interleave language with compact latent visual representations, allowing visual reasoning to occur directly in latent space. LanteRn augments a vision-language transformer with the ability to generate and attend to continuous visual thought embeddings during inference. We train the model in two stages: supervised fine-tuning to ground visual features in latent states, followed by reinforcement learning to align latent reasoning with task-level utility. We evaluate LanteRn on three perception-centric benchmarks (VisCoT, V*, and Blink), observing consistent improvements in visual grounding and fine-grained reasoning. These results suggest that internal latent representations provide a promising direction for more efficient multimodal reasoning.