Multimodal Learning 相关度: 8/10

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, Hang Zhou, Chun Yuan, Jingdong Wang
arXiv: 2603.19228v1 发布: 2026-03-19 更新: 2026-03-19

AI 摘要

SAMA通过解耦语义锚定和运动对齐,提升指令引导的视频编辑效果,实现更精确的语义修改和更真实的运动保持。

主要贡献

  • 提出语义锚定,实现指令感知的结构规划
  • 提出运动对齐,利用视频恢复预训练增强运动建模能力
  • 通过分解预训练和监督微调的两阶段优化

方法论

SAMA将视频编辑分解为语义锚定和运动建模,分别进行预训练和微调,以实现更好的编辑效果。

原文摘要

Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

标签

视频编辑 多模态学习 指令引导 语义锚定 运动对齐

arXiv 分类

cs.CV