Multimodal Learning 相关度: 9/10

SurgTEMP: Temporal-Aware Surgical Video Question Answering with Text-guided Visual Memory for Laparoscopic Cholecystectomy

Shi Li, Vinkle Srivastav, Nicolas Chanel, Saurav Sharma, Nabani Banik, Lorenzo Arboit, Kun Yuan, Pietro Mascagni, Nicolas Padoy
arXiv: 2603.29962v1 发布: 2026-03-31 更新: 2026-03-31

AI 摘要

SurgTEMP通过分层视觉记忆和SCP训练,提升了手术视频问答在时间语义理解和多任务评估上的性能。

主要贡献

  • 提出了SurgTEMP框架,融合了查询引导的token选择和手术能力发展(SCP)训练。
  • 构建了包含32K问答对和3,855个视频片段的CholeVidQA-32K数据集。
  • SurgTEMP在手术视频问答任务上取得了显著的性能提升。

方法论

SurgTEMP利用查询引导的token选择构建分层视觉记忆,并通过手术能力发展(SCP)训练提升模型性能。

原文摘要

Surgical procedures are inherently complex and risky, requiring extensive expertise and constant focus to well navigate evolving intraoperative scenes. Computer-assisted systems such as surgical visual question answering (VQA) offer promises for education and intraoperative support. Current surgical VQA research largely focuses on static frame analysis, overlooking rich temporal semantics. Surgical video question answering is further challenged by low visual contrast, its highly knowledge-driven nature, diverse analytical needs spanning scattered temporal windows, and the hierarchy from basic perception to high-level intraoperative assessment. To address these challenges, we propose SurgTEMP, a multimodal LLM framework featuring (i) a query-guided token selection module that builds hierarchical visual memory (spatial and temporal memory banks) and (ii) a Surgical Competency Progression (SCP) training scheme. Together, these components enable effective modeling of variable-length surgical videos while preserving procedure-relevant cues and temporal coherence, and better support diverse downstream assessment tasks. To support model development, we introduce CholeVidQA-32K, a surgical video question answering dataset comprising 32K open-ended QA pairs and 3,855 video segments (approximately 128 h total) from laparoscopic cholecystectomy. The dataset is organized into a three-level hierarchy -- Perception, Assessment, and Reasoning -- spanning 11 tasks from instrument/action/anatomy perception to Critical View of Safety (CVS), intraoperative difficulty, skill proficiency, and adverse event assessment. In comprehensive evaluations against state-of-the-art open-source multimodal and video LLMs (fine-tuned and zero-shot), SurgTEMP achieves substantial performance improvements, advancing the state of video-based surgical VQA.

标签

手术视频 视觉问答 多模态学习 时间语义

arXiv 分类

cs.CV