Multimodal Learning 相关度: 9/10

BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

Xin Wu, Zhixuan Liang, Yue Ma, Mengkang Hu, Zhiyuan Qin, Xiu Li
arXiv: 2602.08392v1 发布: 2026-02-09 更新: 2026-02-09

AI 摘要

提出BiManiBench基准测试MLLM在双臂操作中的空间推理、规划和控制能力。

主要贡献

  • 提出了BiManiBench双臂操作基准测试
  • 评估了MLLM在双臂任务中的性能
  • 揭示了MLLM在双臂空间推理和控制方面的不足

方法论

构建分层基准,包含空间推理、动作规划和末端执行器控制三个层次,评估MLLM在双臂任务中的表现。

原文摘要

Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predominantly confined to single-arm manipulation, failing to capture the spatio-temporal coordination required for bimanual tasks like lifting a heavy pot. To address this, we introduce BiManiBench, a hierarchical benchmark evaluating MLLMs across three tiers: fundamental spatial reasoning, high-level action planning, and low-level end-effector control. Our framework isolates unique bimanual challenges, such as arm reachability and kinematic constraints, thereby distinguishing perceptual hallucinations from planning failures. Analysis of over 30 state-of-the-art models reveals that despite high-level reasoning proficiency, MLLMs struggle with dual-arm spatial grounding and control, frequently resulting in mutual interference and sequencing errors. These findings suggest the current paradigm lacks a deep understanding of mutual kinematic constraints, highlighting the need for future research to focus on inter-arm collision-avoidance and fine-grained temporal sequencing.

标签

MLLM Bimanual Manipulation Benchmark Robotics

arXiv 分类

cs.RO cs.AI cs.CV