StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation
AI 摘要
提出StreamingVLA模型,通过并行VLA阶段,减少延迟和执行停顿,提高效率。
主要贡献
- 提出动作流匹配,消除对动作分块的依赖。
- 设计自适应观察机制,并行执行和观察阶段。
- 在不牺牲性能的前提下,显著加速并提高执行流畅性。
方法论
采用动作流匹配学习动作轨迹,设计动作显著性感知的自适应观察机制,实现VLA各阶段的异步并行。
原文摘要
Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed sequentially, and wait for the completion of the preceding stage, the system suffers from frequent halting and high latency. To address this, We conduct a systematic analysis to identify the challenges for fast and fluent generation, and propose enabling VLAs with the ability to asynchronously parallelize across VLA stages in a "streaming" manner. First, we eliminate the reliance on action chunking and adopt action flow matching, which learns the trajectory of action flows rather than denoising chunk-wise actions. It overlaps the latency of action generation and execution. Second, we design an action saliency-aware adaptive observation mechanism, thereby overlapping the latency of execution and observation. Without sacrificing performance, StreamingVLA achieves substantial speedup and improves the fluency of execution. It achieves a 2.4 $\times$ latency speedup and reduces execution halting by 6.5 $\times$.