Multimodal Learning 相关度: 9/10

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

Zhuchenyang Liu, Yao Zhang, Yu Xiao
arXiv: 2604.00913v1 发布: 2026-04-01 更新: 2026-04-01

AI 摘要

该论文系统评估了VLM在跨图示装配指令对齐任务中的表现,并分析了影响因素。

主要贡献

  • 构建了IKEA-Bench基准数据集
  • 评估了不同VLM在装配指令对齐任务上的性能
  • 分析了影响对齐性能的关键因素,如视觉编码

方法论

构建包含多种任务类型的基准数据集,评估不同VLM,并进行多层次的机制分析,探究视觉和文本信息的影响。

原文摘要

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/

标签

视觉语言模型 跨图示对齐 装配指令 基准测试

arXiv 分类

cs.CV cs.CL