Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor
AI 摘要
论文提出了一个用于检测和理解有害幽默的多模态、多语言基准数据集,并评估了现有模型。
主要贡献
- 构建了包含文本、图像和视频的多模态、多语言有害幽默数据集
- 提出了区分安全、显性和隐性有害幽默的标注指南
- 评估了现有模型在有害幽默检测上的性能,并揭示了语言和模型之间的差距
方法论
手动构建数据集,定义标注规范,利用现有SOTA模型进行评估,对比不同模型和语言的表现。
原文摘要
Dark humor often relies on subtle cultural nuances and implicit cues that require contextual reasoning to interpret, posing safety challenges that current static benchmarks fail to capture. To address this, we introduce a novel multimodal, multilingual benchmark for detecting and understanding harmful and offensive humor. Our manually curated dataset comprises 3,000 texts and 6,000 images in English and Arabic, alongside 1,200 videos that span English, Arabic, and language-independent (universal) contexts. Unlike standard toxicity datasets, we enforce a strict annotation guideline: distinguishing \emph{Safe} jokes from \emph{Harmful} ones, with the latter further classified into \emph{Explicit} (overt) and \emph{Implicit} (Covert) categories to probe deep reasoning. We systematically evaluate state-of-the-art (SOTA) open and closed-source models across all modalities. Our findings reveal that closed-source models significantly outperform open-source ones, with a notable difference in performance between the English and Arabic languages in both, underscoring the critical need for culturally grounded, reasoning-aware safety alignment. \textcolor{red}{Warning: this paper contains example data that may be offensive, harmful, or biased.}