Multimodal Learning 相关度: 8/10

Low-Resource Guidance for Controllable Latent Audio Diffusion

Zachary Novack, Zack Zukowski, CJ Carr, Julian Parker, Zach Evans, Josiah Taylor, Taylor Berg-Kirkpatrick, Julian McAuley, Jordi Pons

arXiv: 2603.04366v1 发布: 2026-03-04 更新: 2026-03-04

下载 PDF arXiv 页面

AI 摘要

提出一种低资源、可控的潜在音频扩散方法，通过选择性时频引导和潜在控制头实现细粒度音频控制。

主要贡献

提出选择性TFG和LatCHs实现低成本控制
在latent space操作避免昂贵的解码步骤
验证了对强度、音高和节拍的有效控制

方法论

通过选择性时频引导和潜在控制头(LatCHs)在latent space控制音频扩散模型，避免解码器反向传播，降低计算成本。

原文摘要

Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textit{e.g.}, guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-based controls, in particular their high cost-per-step due to decoder backpropagation, we introduce a guidance-based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and $\approx$ 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance. Demo examples can be found at https://zacharynovack.github.io/latch/latch.html.

arXiv 分类

cs.SD cs.AI cs.LG

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类