Learning New Things - Crowdsourcing, Social Media Content Moderation and Development Tools: 2021

Thursday, January 7, 2021

Workshop for Annotators

Let's say that you are trying to train a model using labelled data. If your model to give accurate results the training dataset should be labelled properly. The challenge in our research project is that the labelling process is crucial as we ask the annotators to mark subjective responses. If we work with a pool of annotators we need to make sure everyone is on the same ground.

To achieve this we conducted a workshop on 01/01/2021(What a good way to start a new year). We conducted a workshop for all the annotators of the project including 30 participants teaching the following;

What is hate speech?
How to identify hate speech?
How to distinguish the severe levels of hate speech as it is stated in UN Policy documents.

Slideset used during the workshop is given in link

Recording for the workshop is available in Llink

Hate Speech Corpus Generation using Crowdsourcing

This blog post is an abstract of the research work published recently as a poster.

With the rapid growth of social media use, the numbers of user-generated posts keep growing exponentially. Social media platforms find it challenging to moderate all these posts before reaching to a wider range of audience as the posts are written using multiple languages and using different forms of multimedia. One such content that social media platforms find it difficult to detect is hate speech written in local languages such as Sinhala or Singlish. The contextual, linguistic expertise, social and cultural insights should be considered when identifying hate speech accurately and the social media platforms lack the moderators with this knowledge. Research is being carried out in detecting hate speech on social media contents in English using machine learning algorithms, etc. with the help of crowdsourcing platforms to label and annotate data. But still, a problem exists and it needs further research as common crowdsourcing platforms such as Amazon Mechanical Turk do not recruit workers from Sri Lanka who are with Sinhala literacy to get the data labelled. Following this necessity, in this research, we propose a suitable crowdsourcing approach to label and annotate social media content and to generate corpora with words and phrases so that the algorithms can use the annotated dataset and corpus to identify hate speech using machine learning algorithms. Therefore this research paper focuses on only a sub-area of ongoing research with the mechanism used to identify hate speech to get data annotated, corpus generation and ensuring the trustworthiness of the participants. With the use of a well-implemented crowdsourcing platform, it will be possible to find more nuanced patterns with the use of human judgment and filtering and to take preventive measures to create a better cyberspace.

Because of the Covid 19 pandemic, the SLAAS Annual sessions were conducted as an online educational session. The recorded poster presentation is given in

You can find the poster Link to access the poster