AI Agents 相关度: 9/10

CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun
arXiv: 2603.29664v1 发布: 2026-03-31 更新: 2026-03-31

AI 摘要

CutClaw是一个多智能体框架,利用多模态语言模型自动编辑长视频,实现音乐同步和视觉美观。

主要贡献

  • 提出了一个多智能体视频编辑框架CutClaw
  • 采用了分层多模态分解方法
  • 通过智能体协作优化视频剪辑

方法论

利用分层多模态分解提取音视频信息,通过Playwriter Agent构建叙事流程,Editor和Reviewer Agents协作优化剪辑。

原文摘要

Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.

标签

多智能体 视频编辑 多模态学习 音乐同步

arXiv 分类

cs.CV