Novel Semantic Prompting for Zero-Shot Action Recognition
AI 摘要
论文提出SP-CLIP框架,通过语义提示增强视觉语言模型,提升零样本动作识别性能。
主要贡献
- 提出基于结构化语义提示的零样本动作识别方法
- 设计多层次抽象的语义提示,包含意图、运动、物体交互
- 实验证明语义提示能显著提升细粒度和组合动作识别
方法论
通过多层次语义提示增强冻结的视觉语言模型,利用提示聚合和一致性评分对齐视频和文本表示。
原文摘要
Zero-shot action recognition relies on transferring knowledge from vision-language models to unseen actions using semantic descriptions. While recent methods focus on temporal modeling or architectural adaptations to handle video data, we argue that semantic prompting alone provides a strong and underexplored signal for zero-shot action understanding. We introduce SP-CLIP, a lightweight framework that augments frozen vision-language models with structured semantic prompts describing actions at multiple levels of abstraction, such as intent, motion, and object interaction. Without modifying the visual encoder or learning additional parameters, SP-CLIP aligns video representations with enriched textual semantics through prompt aggregation and consistency scoring. Experiments across standard benchmarks show that semantic prompting substantially improves zero-shot action recognition, particularly for fine-grained and compositional actions, while preserving the efficiency and generalization of pretrained models.