Multimodal Learning 相关度: 9/10

E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou, Jian Xu, Bo Zheng
arXiv: 2602.08355v1 发布: 2026-02-09 更新: 2026-02-09

AI 摘要

提出了电商短视频理解基准E-VAds,并设计了基于RL的推理模型E-VAds-R1。

主要贡献

  • 提出了多模态信息密度评估框架,量化了电商视频的复杂性
  • 构建了电商短视频理解基准E-VAds,包含高质量视频和开放式问答对
  • 开发了基于RL的推理模型E-VAds-R1,提升了电商意图推理性能

方法论

使用多智能体系统生成问答对,设计RL模型进行推理,并采用多粒度奖励机制进行优化。

原文摘要

E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a \textbf{multi-modal information density assessment framework} to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce \textbf{E-commerce Video Ads Benchmark (E-VAds)}, which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we develop \textbf{E-VAds-R1}, an RL-based reasoning model featuring a multi-grained reward design called \textbf{MG-GRPO}. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.

标签

e-commerce short video MLLM reasoning benchmark

arXiv 分类

cs.CV