ActionCodec: What Makes for Good Action Tokenizers
AI 摘要
该论文研究了Vision-Language-Action模型中动作Tokenizers的设计原则,并提出了ActionCodec。
主要贡献
- 提出了VLA优化视角的动作Tokenizer设计原则
- 设计了高性能动作Tokenizer ActionCodec
- 在多个基准测试上验证了ActionCodec的有效性
方法论
基于信息论,提出了最大化时间Token重叠、最小化词汇冗余、增强多模态互信息等设计原则,并据此设计ActionCodec。
原文摘要
Vision-Language-Action (VLA) models leveraging the native autoregressive paradigm of Vision-Language Models (VLMs) have demonstrated superior instruction-following and training efficiency. Central to this paradigm is action tokenization, yet its design has primarily focused on reconstruction fidelity, failing to address its direct impact on VLA optimization. Consequently, the fundamental question of \textit{what makes for good action tokenizers} remains unanswered. In this paper, we bridge this gap by establishing design principles specifically from the perspective of VLA optimization. We identify a set of best practices based on information-theoretic insights, including maximized temporal token overlap, minimized vocabulary redundancy, enhanced multimodal mutual information, and token independence. Guided by these principles, we introduce \textbf{ActionCodec}, a high-performance action tokenizer that significantly enhances both training efficiency and VLA performance across diverse simulation and real-world benchmarks. Notably, on LIBERO, a SmolVLM2-2.2B fine-tuned with ActionCodec achieves a 95.5\% success rate without any robotics pre-training. With advanced architectural enhancements, this reaches 97.4\%, representing a new SOTA for VLA models without robotics pre-training. We believe our established design principles, alongside the released model, will provide a clear roadmap for the community to develop more effective action tokenizers.