annbatch unlocks terabyte-scale training of biological data in anndata
AI 摘要
annbatch加速生物大数据集机器学习训练,优化数据加载瓶颈,提升训练效率。
主要贡献
- 针对Anndata格式优化数据加载
- 提升生物数据机器学习训练速度
- 与scverse生态系统兼容
方法论
开发Anndata原生mini-batch加载器annbatch,实现磁盘数据集上的out-of-core训练。
原文摘要
The scale of biological datasets now routinely exceeds system memory, making data access rather than model computation the primary bottleneck in training machine-learning models. This bottleneck is particularly acute in biology, where widely used community data formats must support heterogeneous metadata, sparse and dense assays, and downstream analysis within established computational ecosystems. Here we present annbatch, a mini-batch loader native to anndata that enables out-of-core training directly on disk-backed datasets. Across single-cell transcriptomics, microscopy and whole-genome sequencing benchmarks, annbatch increases loading throughput by up to an order of magnitude and shortens training from days to hours, while remaining fully compatible with the scverse ecosystem. Annbatch establishes a practical data-loading infrastructure for scalable biological AI, allowing increasingly large and diverse datasets to be used without abandoning standard biological data formats. Github: https://github.com/scverse/annbatch