Hashformers: Hashtag Segmentation Applications in Abusive Language Detection

Ruan Chaves Rodrigues
2 min readMay 20, 2023

Abusive language detection, a critical aspect of modern NLP research, is often challenged by the lack of generalization across different datasets. In light of this, Nina Seemann and other researchers from the CODE Research Institute for Cyber Defence at the Bundeswehr University Munich released recent papers called “Generalizability of Abusive Language Detection Models on Homogeneous German Datasets” and “The problem of varying annotations to identify abusive language in social media content”.

This research demonstrates a fresh approach towards enhancing model generalization across homogeneous datasets. Part of their novel methodology was the deployment of the state-of-the-art hashformers library for effective hashtag segmentation.

The hashformers library is open source and available on GitHub. The library has a Google Colab tutorial and a Hugging Face Space where it is possible to run and play with some of its hashtag segmentation models.

Why Hashformers?

The hashformers library can effectively segment hashtags in any language with a GPT-2 model on the Hugging Face Model Hub. This feature empowered the research team to manipulate German datasets in their research on abusive language detection. It’s important to remember that improved preprocessing often leads to better machine learning results.

Ultimately, the researchers were able to conclude that generalizability depends solely on the combinations of training sets and remains consistent regardless of the underlying method used.

Other Use Cases

This is not the first time that the hashformers library is used in scientific research. It has already been used in a paper published at the LREC 2022 conference by Prashant Kodali and other researchers from Indian Institutes of Technology: “HashSet — A Dataset For Hashtag Segmentation”.

In this paper, they describe a novel dataset that features many hashtags in Indian English. They evaluated the hashtag segmentation capabilities of several models and found out that the hashformers library was up to 25% more accurate than the other open-source alternatives available for the hashtag segmentation task.

Conclusion

By integrating the Hashformers library into their methodology, the researchers at the Research Institute CODE effectively dealt with the problem of generalizability across German language datasets. This use case serves as a testament to the library’s versatility and power in dealing with language-dependent challenges in machine learning, showcasing its potential for broader applications.

--

--

Ruan Chaves Rodrigues

Machine Learning Engineer. MSc student at the EMLCT programme. Personal website: https://ruanchaves.github.io/