Multimodal Learning 相关度: 9/10

How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective

Bo Peng, Pi Bu, Keyu Pan, Xinrun Xu, Yinxiu Zhao, Miao Chen, Yang Du, Lin Li, Jun Song, Tong Xu
arXiv: 2602.20687v1 发布: 2026-02-24 更新: 2026-02-24

AI 摘要

提出了NativeEmbodied基准,用于评估VLM驱动的具身智能体在原生低级动作空间中的技能。

主要贡献

  • 提出了NativeEmbodied基准,包含复杂场景中的高层任务和针对基础技能的低层任务。
  • 分析了现有VLM在具身智能体技能方面的不足。
  • 揭示了基础技能瓶颈对高层任务性能的影响。

方法论

构建包含高低层任务的基准,使用state-of-the-art VLMs进行实验,分析其在不同技能上的表现。

原文摘要

Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents often rely on high-level commands or discretized action spaces, which are non-native settings that differ markedly from real-world control. In addition, current benchmarks focus primarily on high-level tasks and lack joint evaluation and analysis at both low and high levels. To address these limitations, we present NativeEmbodied, a challenging benchmark for VLM-driven embodied agents that uses a unified, native low-level action space. Built on diverse simulated scenes, NativeEmbodied includes three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed analysis, we further decouple the skills required by complex tasks and construct four types of low-level tasks, each targeting a fundamental embodied skill. This joint evaluation across task and skill granularities enables fine-grained assessment of embodied agents. Experiments with state-of-the-art VLMs reveal clear deficiencies in several fundamental embodied skills, and further analysis shows that these bottlenecks significantly limit performance on high-level tasks. NativeEmbodied highlights key challenges for current VLM-driven embodied agents and provides insights to guide future research.

标签

具身智能 VLM 基准 技能评估

arXiv 分类

cs.AI