BioGait-VLM: A Tri-Modal Vision-Language-Biomechanics Framework for Interpretable Clinical Gait Assessment
AI 摘要
BioGait-VLM通过融合视觉、语言和生物力学信息,提升步态分析的泛化性和可解释性。
主要贡献
- 提出了一种三模态的Vision-Language-Biomechanics框架BioGait-VLM
- 引入时间证据提取分支和生物力学标记分支
- 构建了包含DCM队列的统一步态数据集并设置严格的评估协议
方法论
利用时间证据提取和生物力学标记,将3D骨骼序列投影到语义token,实现独立于视觉信息的推理。
原文摘要
Video-based Clinical Gait Analysis often suffers from poor generalization as models overfit environmental biases instead of capturing pathological motion. To address this, we propose BioGait-VLM, a tri-modal Vision-Language-Biomechanics framework for interpretable clinical gait assessment. Unlike standard video encoders, our architecture incorporates a Temporal Evidence Distillation branch to capture rhythmic dynamics and a Biomechanical Tokenization branch that projects 3D skeleton sequences into language-aligned semantic tokens. This enables the model to explicitly reason about joint mechanics independent of visual shortcuts. To ensure rigorous benchmarking, we augment the public GAVD dataset with a high-fidelity Degenerative Cervical Myelopathy (DCM) cohort to form a unified 8-class taxonomy, establishing a strict subject-disjoint protocol to prevent data leakage. Under this setting, BioGait-VLM achieves state-of-the-art recognition accuracy. Furthermore, a blinded expert study confirms that biomechanical tokens significantly improve clinical plausibility and evidence grounding, offering a path toward transparent, privacy-enhanced gait assessment.