Multimodal Learning 相关度: 9/10

VLM2Rec: Resolving Modality Collapse in Vision-Language Model Embedders for Multimodal Sequential Recommendation

Junyoung Kim, Woojoo Kim, Jaehyung Lim, Dongha Kim, Hwanjo Yu
arXiv: 2603.17450v1 发布: 2026-03-18 更新: 2026-03-18

AI 摘要

VLM2Rec提出了一种基于VLM的序列推荐框架,解决了多模态数据中的模态崩溃问题。

主要贡献

  • 发现了VLM在多模态序列推荐中存在模态崩溃问题
  • 提出了弱模态惩罚对比学习以平衡模态利用
  • 提出了跨模态关系拓扑正则化以保持模态一致性

方法论

采用弱模态惩罚对比学习平衡优化过程中的梯度,并利用跨模态关系拓扑正则化保持模态间的几何一致性。

原文摘要

Sequential Recommendation (SR) in multimodal settings typically relies on small frozen pretrained encoders, which limits semantic capacity and prevents Collaborative Filtering (CF) signals from being fully integrated into item representations. Inspired by the recent success of Large Language Models (LLMs) as high-capacity embedders, we investigate the use of Vision-Language Models (VLMs) as CF-aware multimodal encoders for SR. However, we find that standard contrastive supervised fine-tuning (SFT), which adapts VLMs for embedding generation and injects CF signals, can amplify its inherent modality collapse. In this state, optimization is dominated by a single modality while the other degrades, ultimately undermining recommendation accuracy. To address this, we propose VLM2Rec, a VLM embedder-based framework for multimodal sequential recommendation designed to ensure balanced modality utilization. Specifically, we introduce Weak-modality Penalized Contrastive Learning to rectify gradient imbalance during optimization and Cross-Modal Relational Topology Regularization to preserve geometric consistency between modalities. Extensive experiments demonstrate that VLM2Rec consistently outperforms state-of-the-art baselines in both accuracy and robustness across diverse scenarios.

标签

多模态学习 序列推荐 视觉语言模型 对比学习

arXiv 分类

cs.IR cs.AI