Multimodal Learning 相关度: 7/10

Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition

Jian Sun, Mohammad H. Mahoor

arXiv: 2603.10965v1 发布: 2026-03-11 更新: 2026-03-11

下载 PDF arXiv 页面

AI 摘要

提出一种结合对比学习和视频质量评估的视频识别方法SSL-V3，提升低质量视频识别的准确率。

主要贡献

提出结合VQA的自监督学习视频识别框架SSL-V3
使用Combined-SSL机制将VQA融入视频分类
解决视频数据集中VQA标签稀缺问题

方法论

利用对比学习和视频视觉Transformer，结合视频质量评估作为因素调整视频分类特征图，使用监督分类任务调整VQA参数。

原文摘要

Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3's effectiveness.

arXiv 分类

cs.CV

AI 摘要

主要贡献

方法论

原文摘要

标签

arXiv 分类