LLM Reasoning 相关度: 6/10

Sparse autoencoders reveal organized biological knowledge but minimal regulatory logic in single-cell foundation models: a comparative atlas of Geneformer and scGPT

Ihor Kendiukhov
arXiv: 2603.02952v1 发布: 2026-03-03 更新: 2026-03-03

AI 摘要

利用稀疏自编码器分析单细胞模型,揭示其内部知识组织但缺乏调控逻辑。

主要贡献

  • 构建了Geneformer和scGPT的特征图谱,揭示大规模叠加现象
  • 验证模型内部编码了丰富的生物知识,如通路和蛋白互作
  • 发现模型编码的因果调控逻辑有限,存在瓶颈

方法论

使用TopK稀疏自编码器分解Geneformer和scGPT的残差流激活,分析特征并进行GO等注释,结合CRISPRi数据评估调控逻辑。

原文摘要

Background: Single-cell foundation models such as Geneformer and scGPT encode rich biological information, but whether this includes causal regulatory logic rather than statistical co-expression remains unclear. Sparse autoencoders (SAEs) can resolve superposition in neural networks by decomposing dense activations into interpretable features, yet they have not been systematically applied to biological foundation models. Results: We trained TopK SAEs on residual stream activations from all layers of Geneformer V2-316M (18 layers, d=1152) and scGPT whole-human (12 layers, d=512), producing atlases of 82525 and 24527 features, respectively. Both atlases confirm massive superposition, with 99.8 percent of features invisible to SVD. Systematic characterization reveals rich biological organization: 29 to 59 percent of features annotate to Gene Ontology, KEGG, Reactome, STRING, or TRRUST, with U-shaped layer profiles reflecting hierarchical abstraction. Features organize into co-activation modules (141 in Geneformer, 76 in scGPT), exhibit causal specificity (median 2.36x), and form cross-layer information highways (63 to 99.8 percent). When tested against genome-scale CRISPRi perturbation data, only 3 of 48 transcription factors (6.2 percent) show regulatory-target-specific feature responses. A multi-tissue control yields marginal improvement (10.4 percent, 5 of 48 TFs), establishing model representations as the bottleneck. Conclusions: These models have internalized organized biological knowledge, including pathway membership, protein interactions, functional modules, and hierarchical abstraction, yet they encode minimal causal regulatory logic. We release both feature atlases as interactive web platforms enabling exploration of more than 107000 features across 30 layers of two leading single-cell foundation models.

标签

单细胞 基础模型 稀疏自编码器 基因调控

arXiv 分类

q-bio.GN cs.LG q-bio.CB