Multimodal Learning 相关度: 9/10

HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

Jiayue Pu, Zhongxiang Sun, Zilu Zhang, Xiao Zhang, Jun Xu
arXiv: 2603.11975v1 发布: 2026-03-12 更新: 2026-03-12

AI 摘要

提出了HomeSafe-Bench,评估VLMs在家庭环境中不安全行为检测的能力,并提出了一种高效的检测架构HD-Guard。

主要贡献

  • 提出了HomeSafe-Bench基准测试
  • 设计了Hierarchical Dual-Brain Guard (HD-Guard)架构
  • 分析了现有VLM在安全检测方面的瓶颈

方法论

构建包含物理仿真和视频生成的混合pipeline,创建HomeSafe-Bench数据集;设计分层架构HD-Guard,协调FastBrain和SlowBrain。

原文摘要

The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce \textbf{HomeSafe-Bench}, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose \textbf{Hierarchical Dual-Brain Guard for Household Safety (HD-Guard)}, a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.

标签

Vision-Language Models Embodied Agents Unsafe Action Detection Benchmark

arXiv 分类

cs.CV cs.AI cs.CR