LLM Reasoning 相关度: 7/10

Inference-Time Toxicity Mitigation in Protein Language Models

Manuel Fernández Burda, Santiago Aranguri, Iván Arcuschin Moreno, Enzo Ferrante

arXiv: 2603.04045v1 发布: 2026-03-04 更新: 2026-03-04

下载 PDF arXiv 页面

AI 摘要

论文提出一种无需重训练的推理时方法LDA，用于降低蛋白质语言模型生成毒性蛋白的风险。

主要贡献

提出LDA方法，降低PLM生成的毒性蛋白
证明LDA在降低毒性的同时保持蛋白质的生物学合理性
对比LDA和其他方法的优缺点

方法论

采用Logit Diff Amplification (LDA) 技术，通过放大基线模型和毒性微调模型之间的logit差异来修改token概率。

原文摘要

Protein language models (PLMs) are becoming practical tools for de novo protein design, yet their dual-use potential raises safety concerns. We show that domain adaptation to specific taxonomic groups can elicit toxic protein generation, even when toxicity is not the training objective. To address this, we adapt Logit Diff Amplification (LDA) as an inference-time control mechanism for PLMs. LDA modifies token probabilities by amplifying the logit difference between a baseline model and a toxicity-finetuned model, requiring no retraining. Across four taxonomic groups, LDA consistently reduces predicted toxicity rate (measured via ToxDL2) below the taxon-finetuned baseline while preserving biological plausibility. We evaluate quality using Fréchet ESM Distance and predicted foldability (pLDDT), finding that LDA maintains distributional similarity to natural proteins and structural viability (unlike activation-based steering methods that tend to degrade sequence properties). Our results demonstrate that LDA provides a practical safety knob for protein generators that mitigates elicited toxicity while retaining generative quality.

arXiv 分类

cs.LG cs.AI

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类