Adapting Vision-Language Models for E-commerce Understanding at Scale
AI 摘要
针对电商场景,论文提出了一种适配通用视觉语言模型的方法,并构建了新的评估体系。
主要贡献
- 提出电商场景下适配通用VLM的策略
- 构建全面的电商产品理解评估套件
- 验证了该方法在提升电商性能的同时,保留了通用多模态能力
方法论
通过大规模实验研究,探索如何针对电商数据特性对通用VLM进行有针对性的适配,并进行性能评估。
原文摘要
E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.