Multimodal Learning 相关度: 8/10

ChopGrad: Pixel-Wise Losses for Latent Video Diffusion via Truncated Backpropagation

Dmitriy Rivkin, Parker Ewen, Lili Gao, Julian Ost, Stefanie Walz, Rasika Kangutkar, Mario Bijelic, Felix Heide

arXiv: 2603.17812v1 发布: 2026-03-18 更新: 2026-03-18

下载 PDF arXiv 页面

AI 摘要

ChopGrad通过截断反向传播降低视频扩散模型训练的显存占用。

主要贡献

提出ChopGrad截断反向传播方案
理论分析证明其有效性
验证其在多个视频生成任务上的优越性

方法论

使用局部帧窗口限制梯度计算，保持全局一致性，实现高效的像素级损失微调。

原文摘要

Recent video diffusion models achieve high-quality generation through recurrent frame processing where each frame generation depends on previous frames. However, this recurrent mechanism means that training such models in the pixel domain incurs prohibitive memory costs, as activations accumulate across the entire video sequence. This fundamental limitation also makes fine-tuning these models with pixel-wise losses computationally intractable for long or high-resolution videos. This paper introduces ChopGrad, a truncated backpropagation scheme for video decoding, limiting gradient computation to local frame windows while maintaining global consistency. We provide a theoretical analysis of this approximation and show that it enables efficient fine-tuning with frame-wise losses. ChopGrad reduces training memory from scaling linearly with the number of video frames (full backpropagation) to constant memory, and compares favorably to existing state-of-the-art video diffusion models across a suite of conditional video generation tasks with pixel-wise losses, including video super-resolution, video inpainting, video enhancement of neural-rendered scenes, and controlled driving video generation.

arXiv 分类

cs.CV cs.AI cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类