Agent Tuning & Optimization 相关度: 7/10

Tutor-Student Reinforcement Learning: A Dynamic Curriculum for Robust Deepfake Detection

Zhanhe Lei, Zhongyuan Wang, Jikang Cheng, Baojin Huang, Yuhong Yang, Zhen Han, Chao Liang, Dengpan Ye
arXiv: 2603.24139v1 发布: 2026-03-25 更新: 2026-03-25

AI 摘要

提出了一种基于强化学习的动态课程学习方法,提升Deepfake检测的鲁棒性和泛化性。

主要贡献

  • 提出Tutor-Student Reinforcement Learning (TSRL)框架
  • 利用强化学习动态优化训练课程
  • 引入EMA损失和遗忘计数等历史学习动态作为状态表示

方法论

将训练过程建模为MDP,利用PPO算法训练Tutor智能体,动态调整样本权重,奖励机制鼓励优先学习高价值样本。

原文摘要

Standard supervised training for deepfake detection treats all samples with uniform importance, which can be suboptimal for learning robust and generalizable features. In this work, we propose a novel Tutor-Student Reinforcement Learning (TSRL) framework to dynamically optimize the training curriculum. Our method models the training process as a Markov Decision Process where a ``Tutor'' agent learns to guide a ``Student'' (the deepfake detector). The Tutor, implemented as a Proximal Policy Optimization (PPO) agent, observes a rich state representation for each training sample, encapsulating not only its visual features but also its historical learning dynamics, such as EMA loss and forgetting counts. Based on this state, the Tutor takes an action by assigning a continuous weight (0-1) to the sample's loss, thereby dynamically re-weighting the training batch. The Tutor is rewarded based on the Student's immediate performance change, specifically rewarding transitions from incorrect to correct predictions. This strategy encourages the Tutor to learn a curriculum that prioritizes high-value samples, such as hard-but-learnable examples, leading to a more efficient and effective training process. We demonstrate that this adaptive curriculum improves the Student's generalization capabilities against unseen manipulation techniques compared to traditional training methods. Code is available at https://github.com/wannac1/TSRL.

标签

Deepfake Detection Reinforcement Learning Curriculum Learning PPO

arXiv 分类

cs.CV cs.LG