Multimodal Learning 相关度: 9/10

Novel Semantic Prompting for Zero-Shot Action Recognition

Salman Iqbal, Waheed Rehman
arXiv: 2603.08289v1 发布: 2026-03-09 更新: 2026-03-09

AI 摘要

论文提出SP-CLIP框架,通过语义提示增强视觉语言模型,提升零样本动作识别性能。

主要贡献

  • 提出基于结构化语义提示的零样本动作识别方法
  • 设计多层次抽象的语义提示,包含意图、运动、物体交互
  • 实验证明语义提示能显著提升细粒度和组合动作识别

方法论

通过多层次语义提示增强冻结的视觉语言模型,利用提示聚合和一致性评分对齐视频和文本表示。

原文摘要

Zero-shot action recognition relies on transferring knowledge from vision-language models to unseen actions using semantic descriptions. While recent methods focus on temporal modeling or architectural adaptations to handle video data, we argue that semantic prompting alone provides a strong and underexplored signal for zero-shot action understanding. We introduce SP-CLIP, a lightweight framework that augments frozen vision-language models with structured semantic prompts describing actions at multiple levels of abstraction, such as intent, motion, and object interaction. Without modifying the visual encoder or learning additional parameters, SP-CLIP aligns video representations with enriched textual semantics through prompt aggregation and consistency scoring. Experiments across standard benchmarks show that semantic prompting substantially improves zero-shot action recognition, particularly for fine-grained and compositional actions, while preserving the efficiency and generalization of pretrained models.

标签

零样本学习 动作识别 视觉语言模型 语义提示

arXiv 分类

cs.CV