Vision-aligned Latent Reasoning for Multi-modal Large Language Model
AI 摘要
VaLR通过动态生成视觉对齐的潜在token,提升MLLM在多步推理中的视觉信息保持能力。
主要贡献
- 提出Vision-aligned Latent Reasoning (VaLR)框架
- VaLR通过对齐MLLM中间嵌入与视觉编码器嵌入来保持视觉知识
- VaLR在长上下文理解和精确视觉感知任务上显著优于现有方法
方法论
VaLR在每次CoT推理步骤前,动态生成视觉对齐的潜在token,并训练其与视觉编码器嵌入对齐,以保留视觉知识。
原文摘要
Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL.