CyberDTD: A Multimodal Benchmark Dataset for Cyberbullying Detection in Tunisian Dialect

AUTHORS

Bechir Sahar Ben
Mekki Asma
Badache Ismail
Ellouze Mariem
Belguith Lamia Hadrich

KEYWORDS

Natural Language Processing NLP

Cyberbullying

Tunisian Dialect

Multimodal Dataset

Document type

Conference papers

Abstract

Effective detection of cyberbullying requires understanding both textual and visual signals, including images with embedded text and user generated comments. This need is even more evident in low resource and multilingual environments such as Tunisia. In this context, this paper establishes CyberDTD (Cyberbullying Detection in Tunisian Dialect), a multimodal dataset designed to support research on cyberbullying detection in the Tunisian Dialect (TD). With 10,802 images across five categories, humor, sarcasm, hate, violence, and neutral. We present, to the best of our knowledge, the first cyberbullying dataset in TD. We provide a comprehensive description covering a wide range of online harassment, while also including neutral examples for balanced analysis. Key challenges such as class imbalance, multimodality, and cultural specificity are highlighted. CyberDTD represents an important resource for building and evaluating machine learning models in low-resource settings, supporting the development of more robust and culturally aware cyberbullying detection systems.

FILE

MORE INFORMATION