Multimodal Learning 相关度: 7/10

Dual-Model Prediction of Affective Engagement and Vocal Attractiveness from Speaker Expressiveness in Video Learning

Hung-Yue Suen, Kuo-En Hung, Fan-Hsun Tseng
arXiv: 2603.18758v1 发布: 2026-03-19 更新: 2026-03-19

AI 摘要

论文提出了一种基于说话人情感表达预测观众情感投入和声音吸引力的双模型方法。

主要贡献

  • 提出了一种说话人中心的情感AI方法,无需观众侧信息即可预测观众反馈
  • 构建了基于MOOCs的大规模语料库
  • 开发了预测情感投入和声音吸引力的回归模型,并在独立测试集上取得了优秀的效果

方法论

利用面部动态、眼动特征、语音韵律和认知语义构建情感投入预测模型,利用声学特征构建声音吸引力预测模型。

原文摘要

This paper outlines a machine learning-enabled speaker-centric Emotion AI approach capable of predicting audience-affective engagement and vocal attractiveness in asynchronous video-based learning, relying solely on speaker-side affective expressions. Inspired by the demand for scalable, privacy-preserving affective computing applications, this speaker-centric Emotion AI approach incorporates two distinct regression models that leverage a massive corpus developed within Massive Open Online Courses (MOOCs) to enable affectively engaging experiences. The regression model predicting affective engagement is developed by assimilating emotional expressions emanating from facial dynamics, oculomotor features, prosody, and cognitive semantics, while incorporating a second regression model to predict vocal attractiveness based exclusively on speaker-side acoustic features. Notably, on speaker-independent test sets, both regression models yielded impressive predictive performance (R2 = 0.85 for affective engagement and R2 = 0.88 for vocal attractiveness), confirming that speaker-side affect can functionally represent aggregated audience feedback. This paper provides a speaker-centric Emotion AI approach substantiated by an empirical study discovering that speaker-side multimodal features, including acoustics, can prospectively forecast audience feedback without necessarily employing audience-side input information.

标签

情感AI 多模态学习 视频分析 机器学习 教育

arXiv 分类

cs.HC cs.CV cs.SD