Multimodal Learning 相关度: 10/10

MMSpec: Benchmarking Speculative Decoding for Vision-Language Models

Hui Shen, Xin Wang, Ping Zhang, Yunta Hsieh, Qi Han, Zhongwei Wan, Ziheng Zhang, Jingxuan Zhang, Jing Xiong, Ziyuan Liu, Yifan Zhang, Hangrui Cao, Chenyang Zhao, Mi Zhang

arXiv: 2603.14989v1 发布: 2026-03-16 更新: 2026-03-16

下载 PDF arXiv 页面

AI 摘要

论文提出了MMSpec基准测试，评估视觉语言模型中推测解码的加速效果，并提出了ViSkip方法。

主要贡献

构建了MMSpec基准测试，包含600个多模态样本
发现了文本LLM推测解码方法在多模态场景下的退化现象
提出了ViSkip推测解码方法，动态适应视觉tokens

方法论

构建多模态基准测试，统一框架评估多种推测解码算法，分析结果并设计自适应视觉token的加速方法。

原文摘要

Vision-language models (VLMs) achieve strong performance on multimodal tasks but suffer from high inference latency due to large model sizes and long multimodal contexts. Speculative decoding has recently emerged as an effective acceleration technique, yet its behavior in VLMs remains insufficiently understood. We introduce MMSpec, the first benchmark for evaluating speculative decoding in vision-language models. MMSpec contains 600 multimodal samples across six task categories and integrates ten representative speculative decoding algorithms under a unified evaluation framework. Our study reveals three key findings: (1) methods designed for text-only LLMs degrade in multimodal scenarios, (2) vision awareness becomes increasingly important at larger batch sizes, and (3) throughput speedup alone does not reliably reflect latency performance. Motivated by these findings, we propose ViSkip, a plug-and-play speculative decoding method that dynamically adapts speculation to vision tokens and achieves state-of-the-art performance.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类