Multimodal Learning 相关度: 9/10

Mario: Multimodal Graph Reasoning with Large Language Models

Yuanfu Sun, Kang Li, Pengkang Guo, Jiajin Liu, Qiaoyu Tan
arXiv: 2603.05181v1 发布: 2026-03-05 更新: 2026-03-05

AI 摘要

Mario提出了一个统一的框架,利用LLM在多模态图上进行推理,解决跨模态一致性和异构模态偏好的问题。

主要贡献

  • 提出了图条件VLM设计,通过对比学习提升跨模态一致性
  • 提出了模态自适应图指令微调机制,利用可学习的路由选择最佳模态配置
  • 在多个多模态图基准测试中,显著优于现有模型

方法论

通过图条件VLM和模态自适应图指令微调,将对齐的多模态特征组织成图感知的指令视图,并引导LLM进行推理。

原文摘要

Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.

标签

多模态学习 图神经网络 大语言模型 推理

arXiv 分类

cs.CV