HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness
AI 摘要
HeiSD框架通过混合推测解码加速具身视觉-语言-动作模型的推理速度,并保持任务成功率。
主要贡献
- 分析了drafter-based和retrieval-based SD在VLA模型中的优缺点
- 提出了HeiSD框架,包含基于检索的SD优化方法和基于运动学的融合度量
- 验证了HeiSD在仿真和真实场景中的加速效果和任务成功率
方法论
提出一种混合推测解码框架HeiSD,通过优化retrieval-based SD和结合运动学信息自动确定混合边界,加速VLA模型的推理。
原文摘要
Vision-Language-Action (VLA) Models have become the mainstream solution for robot control, but suffer from slow inference speeds. Speculative Decoding (SD) is a promising acceleration method which can be divided into two categories: drafter-based SD and retrieval-based SD. Existing methods fail to analyze the advantages and disadvantages of these two types of SD in VLA models, leading to their sole application or optimization. In this paper, we analyze the trajectory patterns of robots controlled by the VLA model and derive a key insight: the two types of SD should be used in a hybrid manner. However, achieving hybrid SD in VLA models poses several challenges: (1) draft rejection and persistent errors in retrieval-based SD; (2) difficulty in determining the hybrid boundary. To address these, we propose the HeiSD framework. We propose a retrieval-based SD optimization method in HeiSD,which contains a verify-skip mechanism and a sequence-wise relaxed acceptance strategy. Moreover, we proposed a kinematic-based fused metric in HeiSD to automatically determine the hybrid boundary. Experimental results demonstrate that HeiSD attains a speedup of up to 2.45x in simulation benchmarks and 2.06x~2.41x in real-world scenarios, while sustaining a high task success rate.