Multimodal Learning 相关度: 8/10

Scaling Video Pretraining for Surgical Foundation Models

Sicheng Lu, Zikai Xiao, Jianhui Wei, Danyu Sun, Qi Lu, Keli Hu, Yang Feng, Jian Wu, Zongxin Yang, Zuozhu Liu
arXiv: 2603.29966v1 发布: 2026-03-31 更新: 2026-03-31

AI 摘要

SurgRec提出了一个可扩展和可复现的手术视频预训练框架,提升了手术视频理解能力。

主要贡献

  • 构建了大规模手术视频数据集
  • 提出了统一的预训练流水线
  • 建立了可复现的评测基准

方法论

通过大规模多源手术视频数据,结合平衡采样,使用MAE和JEPA两种变体进行预训练,并在下游任务上验证。

原文摘要

Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.

标签

手术视频理解 预训练 自监督学习

arXiv 分类

cs.CV