Multimodal Learning 相关度: 9/10

Adapting Vision-Language Models for E-commerce Understanding at Scale

Matteo Nulli, Vladimir Orshulevich, Tala Bazazo, Christian Herold, Michael Kozielski, Marcin Mazur, Szymon Tuzel, Cees G. M. Snoek, Seyyed Hadi Hashemi, Omar Javed, Yannick Versley, Shahram Khadivi
arXiv: 2602.11733v1 发布: 2026-02-12 更新: 2026-02-12

AI 摘要

针对电商场景,论文提出了一种适配通用视觉语言模型的方法,并构建了新的评估体系。

主要贡献

  • 提出电商场景下适配通用VLM的策略
  • 构建全面的电商产品理解评估套件
  • 验证了该方法在提升电商性能的同时,保留了通用多模态能力

方法论

通过大规模实验研究,探索如何针对电商数据特性对通用VLM进行有针对性的适配,并进行性能评估。

原文摘要

E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.

标签

视觉语言模型 电商 多模态学习 模型适配

arXiv 分类

cs.CV cs.AI