LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation
AI 摘要
LSA通过对齐语义特征增强交通视频生成的时间一致性,无需额外控制信号。
主要贡献
- 提出LSA框架,用于增强视频生成的时间一致性
- 使用语义特征一致性损失来微调预训练模型
- 在nuScenes和KITTI数据集上验证了方法的有效性
方法论
LSA通过比较真实视频和生成视频中动态对象的语义特征,计算语义特征一致性损失,并结合扩散损失微调模型。
原文摘要
Controllable video generation has emerged as a versatile tool for autonomous driving, enabling realistic synthesis of traffic scenarios. However, existing methods depend on control signals at inference time to guide the generative model towards temporally consistent generation of dynamic objects, limiting their utility as scalable and generalizable data engines. In this work, we propose Localized Semantic Alignment (LSA), a simple yet effective framework for fine-tuning pre-trained video generation models. LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips. Specifically, we compare the output of an off-the-shelf feature extraction model between the ground-truth and generated video clips localized around dynamic objects inducing a semantic feature consistency loss. We fine-tune the base model by combining this loss with the standard diffusion loss. The model fine-tuned for a single epoch with our novel loss outperforms the baselines in common video generation evaluation metrics. To further test the temporal consistency in generated videos we adapt two additional metrics from object detection task, namely mAP and mIoU. Extensive experiments on nuScenes and KITTI datasets show the effectiveness of our approach in enhancing temporal consistency in video generation without the need for external control signals during inference and any computational overheads.