Multimodal Learning 相关度: 9/10

Novel Semantic Prompting for Zero-Shot Action Recognition

Salman Iqbal, Waheed Rehman

arXiv: 2603.08289v1 发布: 2026-03-09 更新: 2026-03-09

下载 PDF arXiv 页面

AI 摘要

论文提出SP-CLIP框架，通过语义提示增强视觉语言模型，提升零样本动作识别性能。

主要贡献

提出基于结构化语义提示的零样本动作识别方法
设计多层次抽象的语义提示，包含意图、运动、物体交互
实验证明语义提示能显著提升细粒度和组合动作识别

方法论

通过多层次语义提示增强冻结的视觉语言模型，利用提示聚合和一致性评分对齐视频和文本表示。

原文摘要

Zero-shot action recognition relies on transferring knowledge from vision-language models to unseen actions using semantic descriptions. While recent methods focus on temporal modeling or architectural adaptations to handle video data, we argue that semantic prompting alone provides a strong and underexplored signal for zero-shot action understanding. We introduce SP-CLIP, a lightweight framework that augments frozen vision-language models with structured semantic prompts describing actions at multiple levels of abstraction, such as intent, motion, and object interaction. Without modifying the visual encoder or learning additional parameters, SP-CLIP aligns video representations with enriched textual semantics through prompt aggregation and consistency scoring. Experiments across standard benchmarks show that semantic prompting substantially improves zero-shot action recognition, particularly for fine-grained and compositional actions, while preserving the efficiency and generalization of pretrained models.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类