Multi-turn Physics-informed Vision-language Model for Physics-grounded Anomaly Detection
AI 摘要
提出物理信息指导的多轮对话视觉语言模型,显著提升物理异常检测性能。
主要贡献
- 提出物理信息指导的指令微调框架
- 引入多轮对话分解因果推理
- 大幅提升物理异常检测AUROC
方法论
通过多轮对话将物理先验编码到VLM中,分解因果推理,增强VLM对正常和异常动态的表征能力。
原文摘要
Vision-Language Models (VLMs) demonstrate strong general-purpose reasoning but remain limited in physics-grounded anomaly detection, where causal understanding of dynamics is essential. Existing VLMs, trained predominantly on appearance-centric correlations, fail to capture kinematic constraints, leading to poor performance on anomalies such as irregular rotations or violated mechanical motions. We introduce a physics-informed instruction tuning framework that explicitly encodes object properties, motion paradigms, and dynamic constraints into structured prompts. By delivering these physical priors through multi-turn dialogues, our method decomposes causal reasoning into incremental steps, enabling robust internal representations of normal and abnormal dynamics. Evaluated on the Phys-AD benchmark, our approach achieves 96.7% AUROC in video-level detection--substantially outperforming prior SOTA (66.9%)--and yields superior causal explanations (0.777 LLM score). This work highlights how structured physics priors can transform VLMs into reliable detectors of dynamic anomalies.