Multimodal Learning 相关度: 9/10

Enhancing Zero-shot Commonsense Reasoning by Integrating Visual Knowledge via Machine Imagination

Hyuntae Park, Yeachan Kim, SangKeun Lee

arXiv: 2603.05040v1 发布: 2026-03-05 更新: 2026-03-05

下载 PDF arXiv 页面

AI 摘要

通过机器想象补充视觉知识，增强零样本常识推理能力，有效缓解文本知识的偏差。

主要贡献

提出Imagine框架，将图像生成器融入推理流程。
构建合成数据集，模拟视觉问答场景，辅助视觉上下文利用。
实验证明，Imagine显著优于现有零样本方法和大型语言模型。

方法论

通过生成图像补充文本输入，利用视觉信息增强PLM的推理能力，并使用合成数据进行训练。

原文摘要

Recent advancements in zero-shot commonsense reasoning have empowered Pre-trained Language Models (PLMs) to acquire extensive commonsense knowledge without requiring task-specific fine-tuning. Despite this progress, these models frequently suffer from limitations caused by human reporting biases inherent in textual knowledge, leading to understanding discrepancies between machines and humans. To bridge this gap, we introduce an additional modality to enrich the reasoning capabilities of PLMs. We propose Imagine (Machine Imagination-based Reasoning), a novel zero-shot commonsense reasoning framework that supplements textual inputs with visual signals from machine-generated images. Specifically, we enhance PLMs with the ability to imagine by embedding an image generator directly into the reasoning pipeline. To facilitate effective utilization of this imagined visual context, we construct synthetic datasets designed to emulate visual question-answering scenarios. Through comprehensive evaluations on multiple commonsense reasoning benchmarks, we demonstrate that Imagine substantially outperforms existing zero-shot approaches and even surpasses advanced large language models. These results underscore the capability of machine imagination to mitigate reporting bias and significantly enhance the generalization ability of commonsense reasoning models

arXiv 分类

cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类