← Publications · Pablo Rodenas Ruiz
MM-HSD: Multi-Modal Hate Speech Detection in Videos
B. Céspedes-Sarrias*, C. Collado-Capell*, P. Rodenas-Ruiz*, O. Hrynenko, A. Cavallaro
Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM '25), Dublin, Ireland, 2025 · Equal contribution: B. Céspedes-Sarrias, C. Collado-Capell, P. Rodenas-Ruiz.
arXiv:2508.20546 · doi:10.1145/3746027.3754558 · PDF · BibTeX
Abstract
MM-HSD is a multi-modal model for hate-speech detection in videos that integrates video frames, audio, and text (from speech transcripts and on-screen text) together with features extracted by Cross-Modal Attention (CMA). We are the first to use CMA as an early feature extractor for this task, systematically comparing query/key configurations across modalities. On-screen text works best as the query, and on the HateMM dataset MM-HSD reaches state-of-the-art performance with an M-F1 score of 0.874, outperforming previous approaches.
Cite (BibTeX)
@inproceedings{cspedessarrias2025mmhsd,
title = {{MM-HSD: Multi-Modal Hate Speech Detection in Videos}},
author = {Céspedes-Sarrias, B. and Collado-Capell, C. and Rodenas-Ruiz, P. and Hrynenko, O. and Cavallaro, A.},
year = {2025},
month = oct,
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM '25), Dublin, Ireland},
eprint = {2508.20546},
archivePrefix = {arXiv},
doi = {10.1145/3746027.3754558},
url = {https://arxiv.org/abs/2508.20546},
note = {Equal contribution: B. Céspedes-Sarrias, C. Collado-Capell, P. Rodenas-Ruiz.},
abstract = {MM-HSD is a multi-modal model for hate-speech detection in videos that integrates video frames, audio, and text (from speech transcripts and on-screen text) together with features extracted by Cross-Modal Attention (CMA). We are the first to use CMA as an early feature extractor for this task, systematically comparing query/key configurations across modalities. On-screen text works best as the query, and on the HateMM dataset MM-HSD reaches state-of-the-art performance with an M-F1 score of 0.874, outperforming previous approaches.}
}
View this paper on pablorodenas.me.