Let's say that you are trying to train a model using labelled data. If your model to give accurate results the training dataset should be labelled properly. The challenge in our research project is that the labelling process is crucial as we ask the annotators to mark subjective responses. If we work with a pool of annotators we need to make sure everyone is on the same ground.
To achieve this we conducted a workshop on 01/01/2021(What a good way to start a new year). We conducted a workshop for all the annotators of the project including 30 participants teaching the following;
What is hate speech?
How to identify hate speech?
How to distinguish the severe levels of hate speech as it is stated in UN Policy documents.
Slideset used during the workshop is given in link
This blog post is an abstract of the research work published recently as a poster.
With the rapid growth of social media
use, the numbers of user-generated posts keep growing exponentially. Social
media platforms find it challenging to moderate all these posts before reaching
to a wider range of audience as the posts are written using multiple languages
and using different forms of multimedia. One such content that social media
platforms find it difficult to detect is hate speech written in local languages
such as Sinhala or Singlish. The contextual, linguistic expertise, social and
cultural insights should be considered when identifying hate speech accurately
and the social media platforms lack the moderators with this knowledge.
Research is being carried out in detecting hate speech on social media contents
in English using machine learning algorithms, etc. with the help of
crowdsourcing platforms to label and annotate data. But still, a problem exists
and it needs further research as common crowdsourcing platforms such as Amazon
Mechanical Turk do not recruit workers from Sri Lanka who are with Sinhala literacy
to get the data labelled. Following this necessity, in this research, we
propose a suitable crowdsourcing approach to label and annotate social media
content and to generate corpora with words and phrases so that the algorithms
can use the annotated dataset and corpus to identify hate speech using machine
learning algorithms. Therefore this research paper focuses on only a sub-area
of ongoing research with the mechanism used to identify hate speech to get data
annotated, corpus generation and ensuring the trustworthiness of the participants.
With the use of a well-implemented crowdsourcing platform, it will be possible
to find more nuanced patterns with the use of human judgment and filtering and
to take preventive measures to create a better cyberspace.
Because of the Covid 19 pandemic, the SLAAS Annual sessions were conducted as an online educational session. The recorded poster presentation is given in