MMSpec: Benchmarking Speculative Decoding for Vision-Language Models
AI 摘要
论文提出了MMSpec基准测试,评估视觉语言模型中推测解码的加速效果,并提出了ViSkip方法。
主要贡献
- 构建了MMSpec基准测试,包含600个多模态样本
- 发现了文本LLM推测解码方法在多模态场景下的退化现象
- 提出了ViSkip推测解码方法,动态适应视觉tokens
方法论
构建多模态基准测试,统一框架评估多种推测解码算法,分析结果并设计自适应视觉token的加速方法。
原文摘要
Vision-language models (VLMs) achieve strong performance on multimodal tasks but suffer from high inference latency due to large model sizes and long multimodal contexts. Speculative decoding has recently emerged as an effective acceleration technique, yet its behavior in VLMs remains insufficiently understood. We introduce MMSpec, the first benchmark for evaluating speculative decoding in vision-language models. MMSpec contains 600 multimodal samples across six task categories and integrates ten representative speculative decoding algorithms under a unified evaluation framework. Our study reveals three key findings: (1) methods designed for text-only LLMs degrade in multimodal scenarios, (2) vision awareness becomes increasingly important at larger batch sizes, and (3) throughput speedup alone does not reliably reflect latency performance. Motivated by these findings, we propose ViSkip, a plug-and-play speculative decoding method that dynamically adapts speculation to vision tokens and achieves state-of-the-art performance.