Multimodal Learning 相关度: 9/10

Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation

Enrico Guerriero, Kjersti Engan, Øyvind Meinich-Bache

arXiv: 2602.12002v1 发布: 2026-02-12 更新: 2026-02-12

下载 PDF arXiv 页面

AI 摘要

论文研究局部视觉-语言模型在新生儿复苏活动识别上的应用，并超越了ViT。

主要贡献

探索局部VLM在新生儿复苏活动识别中的潜力
使用LoRA微调VLM，显著提升了活动识别的F1分数
对比VLM与TimeSformer在新生儿复苏活动识别上的性能

方法论

使用包含13.26小时新生儿复苏视频的模拟数据集，评估zero-shot VLM和LoRA微调的VLM在活动识别中的表现。

原文摘要

Accurate documentation of newborn resuscitation is essential for quality improvement and adherence to clinical guidelines, yet remains underutilized in practice. Previous work using 3D-CNNs and Vision Transformers (ViT) has shown promising results in detecting key activities from newborn resuscitation videos, but also highlighted the challenges in recognizing such fine-grained activities. This work investigates the potential of generative AI (GenAI) methods to improve activity recognition from such videos. Specifically, we explore the use of local vision-language models (VLMs), combined with large language models (LLMs), and compare them to a supervised TimeSFormer baseline. Using a simulated dataset comprising 13.26 hours of newborn resuscitation videos, we evaluate several zero-shot VLM-based strategies and fine-tuned VLMs with classification heads, including Low-Rank Adaptation (LoRA). Our results suggest that small (local) VLMs struggle with hallucinations, but when fine-tuned with LoRA, the results reach F1 score at 0.91, surpassing the TimeSformer results of 0.70.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类