Multimodal Learning 相关度: 9/10

Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment

Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone
arXiv: 2602.12281v1 发布: 2026-02-12 更新: 2026-02-12

AI 摘要

验证方法比策略学习更有效地提升视觉-语言-动作对齐,并提出了CoVer框架。

主要贡献

  • 提出test-time验证方法提升VLA模型性能
  • 提出对比验证器CoVer,提升计算效率和数据利用率
  • 提出boot-time compute和分层验证推理管线

方法论

利用VLM生成多样指令,重复生成动作候选,通过对比验证器选择最优提示和动作块。

原文摘要

The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.

标签

视觉-语言-动作对齐 验证 机器人 指令跟随

arXiv 分类

cs.RO cs.AI eess.SY