AI Agents 相关度: 5/10

Kill it with FIRE: On Leveraging Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural Networks

Enrico Ahlers, Daniel Passon, Yannic Noller, Lars Grunske
arXiv: 2602.10780v1 发布: 2026-02-11 更新: 2026-02-11

AI 摘要

提出FIRE方法,通过操纵模型内部表征来防御深度神经网络的运行时后门攻击。

主要贡献

  • 提出了一种新的运行时后门防御方法FIRE。
  • 利用潜在空间方向来中和后门触发器。
  • 在图像基准上优于现有的运行时缓解方法。

方法论

通过将后门视为潜在空间中的方向,FIRE反向应用这些方向来消除触发器,从而修正推理机制。

原文摘要

Machine learning models are increasingly present in our everyday lives; as a result, they become targets of adversarial attackers seeking to manipulate the systems we interact with. A well-known vulnerability is a backdoor introduced into a neural network by poisoned training data or a malicious training process. Backdoors can be used to induce unwanted behavior by including a certain trigger in the input. Existing mitigations filter training data, modify the model, or perform expensive input modifications on samples. If a vulnerable model has already been deployed, however, those strategies are either ineffective or inefficient. To address this gap, we propose our inference-time backdoor mitigation approach called FIRE (Feature-space Inference-time REpair). We hypothesize that a trigger induces structured and repeatable changes in the model's internal representation. We view the trigger as directions in the latent spaces between layers that can be applied in reverse to correct the inference mechanism. Therefore, we turn the backdoored model against itself by manipulating its latent representations and moving a poisoned sample's features along the backdoor directions to neutralize the trigger. Our evaluation shows that FIRE has low computational overhead and outperforms current runtime mitigations on image benchmarks across various attacks, datasets, and network architectures.

标签

后门攻击 神经网络安全 运行时防御 潜在空间

arXiv 分类

cs.LG cs.AI cs.CR cs.CV