Multimodal Learning 相关度: 9/10

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

Chanyoung Kim, Minwoo Kim, Minseok Kang, Hyunwoo Kim, Dahuin Jung
arXiv: 2603.28301v1 发布: 2026-03-30 更新: 2026-03-30

AI 摘要

LIBERO-Para基准测试VLA模型在指令复述下的鲁棒性,发现性能显著下降,并提出PRIDE度量指标。

主要贡献

  • 构建了LIBERO-Para基准测试,用于评估VLA模型在指令复述下的鲁棒性。
  • 发现了VLA模型在指令复述下性能显著下降,尤其是在物体层面。
  • 提出了PRIDE指标,用于量化复述的难度。

方法论

构建包含动作表达式和对象引用变化的指令复述数据集,在不同VLA模型上进行测试,并分析性能下降的原因和程度。

原文摘要

Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: https://github.com/cau-hai-lab/LIBERO-Para

标签

Vision-Language-Action Robotics Paraphrase Robustness Benchmark Evaluation Metrics

arXiv 分类

cs.LG