MM-HSD: Multi-Modal Hate Speech Detection in Videos

Céspedes-Sarrias, B.; Collado-Capell, C.; Rodenas-Ruiz, P.; Hrynenko, O.; Cavallaro, A.

doi:10.1145/3746027.3754558

MM-HSD: Multi-Modal Hate Speech Detection in Videos

B. Céspedes-Sarrias*, C. Collado-Capell*, P. Rodenas-Ruiz*, O. Hrynenko, A. Cavallaro

Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM '25), Dublin, Ireland, 2025 · Equal contribution: B. Céspedes-Sarrias, C. Collado-Capell, P. Rodenas-Ruiz.

arXiv:2508.20546 · doi:10.1145/3746027.3754558 · PDF · BibTeX

Abstract

MM-HSD is a multi-modal model for hate-speech detection in videos that integrates video frames, audio, and text (from speech transcripts and on-screen text) together with features extracted by Cross-Modal Attention (CMA). We are the first to use CMA as an early feature extractor for this task, systematically comparing query/key configurations across modalities. On-screen text works best as the query, and on the HateMM dataset MM-HSD reaches state-of-the-art performance with an M-F1 score of 0.874, outperforming previous approaches.

Cite (BibTeX)

@inproceedings{cspedessarrias2025mmhsd,
  title = {{MM-HSD: Multi-Modal Hate Speech Detection in Videos}},
  author = {Céspedes-Sarrias, B. and Collado-Capell, C. and Rodenas-Ruiz, P. and Hrynenko, O. and Cavallaro, A.},
  year = {2025},
  month = oct,
  booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM '25), Dublin, Ireland},
  eprint = {2508.20546},
  archivePrefix = {arXiv},
  doi = {10.1145/3746027.3754558},
  url = {https://arxiv.org/abs/2508.20546},
  note = {Equal contribution: B. Céspedes-Sarrias, C. Collado-Capell, P. Rodenas-Ruiz.},
  abstract = {MM-HSD is a multi-modal model for hate-speech detection in videos that integrates video frames, audio, and text (from speech transcripts and on-screen text) together with features extracted by Cross-Modal Attention (CMA). We are the first to use CMA as an early feature extractor for this task, systematically comparing query/key configurations across modalities. On-screen text works best as the query, and on the HateMM dataset MM-HSD reaches state-of-the-art performance with an M-F1 score of 0.874, outperforming previous approaches.}
}

View this paper on pablorodenas.me.