Multimodal Learning 相关度: 9/10

History-Guided Iterative Visual Reasoning with Self-Correction

Xinglong Yang, Zhilin Peng, Zhanzhan Liu, Haochen Shi, Sheng-Jun Huang
arXiv: 2602.04413v1 发布: 2026-02-04 更新: 2026-02-04

AI 摘要

提出H-GIVR框架,通过历史信息引导迭代视觉推理,动态纠错,提高多模态大模型的推理准确性。

主要贡献

  • 提出历史引导的迭代视觉推理框架H-GIVR
  • 利用历史推理信息动态纠正视觉理解错误
  • 在多个数据集和模型上验证了H-GIVR的有效性

方法论

H-GIVR框架使MLLM多次观察图像,并将先前生成的答案作为后续步骤的参考,从而实现动态错误纠正。

原文摘要

Self-consistency methods are the core technique for improving the reasoning reliability of multimodal large language models (MLLMs). By generating multiple reasoning results through repeated sampling and selecting the best answer via voting, they play an important role in cross-modal tasks. However, most existing self-consistency methods are limited to a fixed ``repeated sampling and voting'' paradigm and do not reuse historical reasoning information. As a result, models struggle to actively correct visual understanding errors and dynamically adjust their reasoning during iteration. Inspired by the human reasoning behavior of repeated verification and dynamic error correction, we propose the H-GIVR framework. During iterative reasoning, the MLLM observes the image multiple times and uses previously generated answers as references for subsequent steps, enabling dynamic correction of errors and improving answer accuracy. We conduct comprehensive experiments on five datasets and three models. The results show that the H-GIVR framework can significantly improve cross-modal reasoning accuracy while maintaining low computational cost. For instance, using \texttt{Llama3.2-vision:11b} on the ScienceQA dataset, the model requires an average of 2.57 responses per question to achieve an accuracy of 78.90\%, representing a 107\% improvement over the baseline.

标签

多模态学习 视觉推理 自洽性 迭代推理 动态纠错

arXiv 分类

cs.CL cs.AI cs.MM