Multimodal Learning 相关度: 9/10

BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

Xin Wu, Zhixuan Liang, Yue Ma, Mengkang Hu, Zhiyuan Qin, Xiu Li

arXiv: 2602.08392v1 发布: 2026-02-09 更新: 2026-02-09

下载 PDF arXiv 页面

AI 摘要

提出BiManiBench基准测试MLLM在双臂操作中的空间推理、规划和控制能力。

主要贡献

提出了BiManiBench双臂操作基准测试
评估了MLLM在双臂任务中的性能
揭示了MLLM在双臂空间推理和控制方面的不足

方法论

构建分层基准，包含空间推理、动作规划和末端执行器控制三个层次，评估MLLM在双臂任务中的表现。

原文摘要

Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predominantly confined to single-arm manipulation, failing to capture the spatio-temporal coordination required for bimanual tasks like lifting a heavy pot. To address this, we introduce BiManiBench, a hierarchical benchmark evaluating MLLMs across three tiers: fundamental spatial reasoning, high-level action planning, and low-level end-effector control. Our framework isolates unique bimanual challenges, such as arm reachability and kinematic constraints, thereby distinguishing perceptual hallucinations from planning failures. Analysis of over 30 state-of-the-art models reveals that despite high-level reasoning proficiency, MLLMs struggle with dual-arm spatial grounding and control, frequently resulting in mutual interference and sequencing errors. These findings suggest the current paradigm lacks a deep understanding of mutual kinematic constraints, highlighting the need for future research to focus on inter-arm collision-avoidance and fine-grained temporal sequencing.

arXiv 分类

cs.RO cs.AI cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类