Multimodal Learning 相关度: 9/10

Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection

Yao Gu, Xiaohao Xu, Yingna Wu
arXiv: 2603.15237v1 发布: 2026-03-16 更新: 2026-03-16

AI 摘要

提出物理信息指导的多轮对话视觉语言模型,显著提升物理异常检测性能。

主要贡献

  • 提出物理信息指导的指令微调框架
  • 引入多轮对话分解因果推理
  • 大幅提升物理异常检测AUROC

方法论

通过多轮对话将物理先验编码到VLM中,分解因果推理,增强VLM对正常和异常动态的表征能力。

原文摘要

Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection--substantially outperforming prior SOTA (66.9%)--and yields superior causal explanations (0.777 LLM score). This work highlights how structured physics priors can transform VLMs into reliable detectors of dynamic anomalies.

标签

视觉语言模型 物理异常检测 指令微调 多轮对话 物理信息

arXiv 分类

cs.CV