Multimodal Learning 相关度: 8/10

Scaling Video Pretraining for Surgical Foundation Models

Sicheng Lu, Zikai Xiao, Jianhui Wei, Danyu Sun, Qi Lu, Keli Hu, Yang Feng, Jian Wu, Zongxin Yang, Zuozhu Liu

arXiv: 2603.29966v1 发布: 2026-03-31 更新: 2026-03-31

下载 PDF arXiv 页面

AI 摘要

SurgRec提出了一个可扩展和可复现的手术视频预训练框架，提升了手术视频理解能力。

主要贡献

构建了大规模手术视频数据集
提出了统一的预训练流水线
建立了可复现的评测基准

方法论

通过大规模多源手术视频数据，结合平衡采样，使用MAE和JEPA两种变体进行预训练，并在下游任务上验证。

原文摘要

Surgical video understanding is essential for computer-assisted interventions, yet existing surgical foundation models remain constrained by limited data scale, procedural diversity, and inconsistent evaluation, often lacking a reproducible training pipeline. We propose SurgRec, a scalable and reproducible pretraining recipe for surgical video understanding, instantiated with two variants: SurgRec-MAE and SurgRec-JEPA. We curate a large multi-source corpus of 10,535 videos and 214.5M frames spanning endoscopy, laparoscopy, cataract, and robotic surgery. Building on this corpus, we develop a unified pretraining pipeline with balanced sampling and standardize a reproducible benchmark across 16 downstream datasets and four clinical domains with consistent data splits. Across extensive comparisons against SSL baselines and vision-language models, SurgRec consistently achieves superior performance across downstream datasets. In contrast, VLMs prove unreliable for fine-grained temporal recognition, exhibiting both performance gaps and sensitivity to prompt phrasing. Our work provides a reproducible, scalable foundation for the community to build more general surgical video models. All code, models, and data will be publicly released.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类