Multimodal Learning 相关度: 9/10

Towards Training-free Multimodal Hate Localisation with Large Language Models

Yueming Sun, Long Yang, Jianbo Jiao, Zeyu Fu
arXiv: 2602.09637v1 发布: 2026-02-10 更新: 2026-02-10

AI 摘要

LELA是首个基于LLM的无训练视频仇恨内容定位框架,优于现有无训练基线。

主要贡献

  • 提出首个无训练的LLM视频仇恨内容定位框架LELA
  • 利用多模态captioning和多阶段prompting实现细粒度定位
  • 引入组合匹配机制增强跨模态推理

方法论

将视频分解为图像、语音等五种模态,通过多阶段prompting计算仇恨分数,并用组合匹配增强跨模态推理。

原文摘要

The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.

标签

LLM Multimodal Hate Speech Detection Video Analysis Training-free

arXiv 分类

cs.CV cs.MM