Multimodal Learning 相关度: 9/10

Beyond Hate: Differentiating Uncivil and Intolerant Speech in Multimodal Content Moderation

Nils A. Herrmann, Tobias Eder, Jingyi He, Georg Groh
arXiv: 2603.22985v1 发布: 2026-03-24 更新: 2026-03-24

AI 摘要

该论文区分了不文明和不容忍言论,提出了细粒度的多模态内容审核方案。

主要贡献

  • 提出了区分incivility和intolerance的细粒度标注方案
  • 验证了细粒度标注结合粗粒度标注可以提升模型性能
  • 使用细粒度标注训练的模型在内容审核中具有更均衡的错误率

方法论

在Hateful Memes数据集上,使用细粒度标注训练视觉-语言模型,并与粗粒度标注进行对比和联合学习。

原文摘要

Current multimodal toxicity benchmarks typically use a single binary hatefulness label. This coarse approach conflates two fundamentally different characteristics of expression: tone and content. Drawing on communication science theory, we introduce a fine-grained annotation scheme that distinguishes two separable dimensions: incivility (rude or dismissive tone) and intolerance (content that attacks pluralism and targets groups or identities) and apply it to 2,030 memes from the Hateful Memes dataset. We evaluate different vision-language models under coarse-label training, transfer learning across label schemes and a joint learning approach that combines the coarse hatefulness label with our fine-grained annotations. Our results show that fine-grained annotations complement existing coarse labels and, when used jointly, improve overall model performance. Moreover, models trained with the fine-grained scheme exhibit more balanced moderation-relevant error profiles and are less prone to under-detection of harmful content than models trained on hatefulness labels alone (FNR-FPR, the difference between false negative and false positive rates: 0.74 to 0.42 for LLaVA-1.6-Mistral-7B; 0.54 to 0.28 for Qwen2.5-VL-7B). This work contributes to data-centric approaches in content moderation by improving the reliability and accuracy of moderation systems through enhanced data quality. Overall, combining both coarse and fine-grained labels provides a practical route to more reliable multimodal moderation.

标签

多模态学习 内容审核 自然语言处理 视觉语言模型 恶意内容检测

arXiv 分类

cs.CL cs.CY