DEVELOPMENT OF COMPUTATIONAL LINGUISTIC RESOURCES FOR AUTOMATED DETECTION OF TEXTUAL CYBERBULLYING THREATS IN ROMAN URDU LANGUAGE

Automatic Cyberbullying detection has remained very challenging task since social media content and conversations are usually posted in unstructured free-text form leaving behind the language norms. The major concern and gap in formulating cyberbullying detection strategies is scarcity of available linguistic resources typically for newly evolved languages. Roman Urdu has recently emerged and hence is a resource poor language. Urdu has been widely known as the national language of Pakistan. However, because of socio-cultural and multilingual aspects, Roman Urdu is used widely on the Internet by Asians and more specifically Pakistanis. To fulfil the above stated gap, this research work presents guidelines for data annotation process and developed two linguistic resources: (i) Annotated corpus in Roman Urdu Language for cyberaggression and offensive language detection. The process of data annotation involved bilingual annotators instead of crowdsourcing. It has the benefit of correctly annotating instances that constitute clear cases of cyberbullying without compromising data quality. The developed corpus is highly balanced (with almost negligible skew) unlike most of the existing corpuses even in mature languages. (ii) Processing textual information for NLP tasks involves Stop-word elimination as a sub phase. Stop words carry least semantic information and increase feature space as compared to the other tokens and index terms in corpora. We have developed domain specific stop words for Roman Urdu Language considering all the lexical variants and typically in the context of aggression detection and collected data. The work has been carried out using python programming language and Pycharm IDE.


INTRODUCTION
The rapid advancement in technology and compelling needs of users have made internet and typically SNS's an integral part of everyone's life, resulting in huge amount of user generated content aka Big social media data. Escalation in Social media has completely shifted the way in which people view, create or share information and ideas (Namdeo et al., 2017).
Undeniably, Web 2.0 has a vital role in the communication, relationships, and collaboration in today's society. The communities belonging to different age groups (children, youngsters, and adults) interact with each other anytime, anywhere in diverse ways (e.g. via laptops, smartphones, tablets etc.) and using wide number of social networking platforms. Even though the perks and positive edges of digital communication are evident since most of the user's internet usage is harmless but the anonymity preservation and freedom of speech often makes young people to be offensive and vulnerable leading towards one of the alarming threat i-e cyberbullying/Cyberaggression or hate speech (Van Hee et al., 2018). People, typically youngsters have reported life disturbing and annoying experiences thus drawing the attention of researchers/scholars and making cyberbullying and its automatic detection a growing community need and a promising area of Natural Language Processing (NLP) (Huang et al., 2018).
Several studies contributed by different researchers are evident that computational formation of cyberaggression detection strategies is extremely challenging. One of the major challenges is posed by the scarcity of the required resources typically for newly emerged languages. Moreover, most of the datasets used for cyberbullying detection, even in mature languages, exhibit an extreme skew between hate speech and non-hate speech textual contents (Emmery et al., 2020). This leads to formation of inappropriate strategies, unreliable predictive performance (specifically for the minority class) and more sensitivity towards classification errors.
With advent of Unicode encoding, Urdu language content, written using roman script, is escalating rapidly on social networking sites. Roman Urdu is a nonnormative language. The written script of this language does not follow any rigid set of grammatical rules or standards of spellings. A survey statistics in (Shahroz et al., 2020) affirms that about 300 million people are speaking Urdu language and approximately 11 million speakers are in Pakistan from which maximum users switched to Roman Urdu language for the textual communication, typically on social media (Shahroz et al., 2020). It is linguistically rich and morphologically complex language .
Urdu orthography (aka imla) bears a resemblance to Trukish, Arabic and Persian languages. Moreover, cursive Arabic and Nastaliques writing style is used (Syed et al., 2010). Roman Urdu uses Roman script.
An example instance of Roman Urdu script and its equivalent Urdu and English scripts are depicted in  Regardless of its huge prevalence worldwide (and more specifically in South Asia), Roman Urdu is an under-resource language. Linguistic resources for Asian languages are typically focused by some of the conferences and journals such as ACM Transactions on Asian and Low-Resource Language of Central Asian Language and Linguistics ("Central Asian Languages and Lingusitics", n.d.) etc. for supporting vast number of NLP tasks related to phonology, morphology, name entity recognition (NER), language parsing and word segmentation.
To support the development of NLP applications for Roman Urdu typically in the field of cyberaggression and hate speech, this paper presents annotation guidelines, the first-ever highly balanced Roman Urdu dataset and development of domain specific stop words using python language.
Rest of this paper is organized as follows: Cyberaggression and existing resources are conferred in section II. Section III puts light on Data extraction from Twitter social media platform. Data annotation guidelines preparation and kappas weighing scheme are given in section VI & V respectively. Section VI discusses Stop word development. Finally, Section VII conclude the research work.

RELATED WORK
Even though the researchers have widely used Natural Language Processing (NLP) and realized Machine Learning (ML) techniques to uncover solutions for variety of tasks based on unstructured text data (e.g. topic identification, opinion mining, document summarization, text translation etc.), but it's applicability for resolving automatic detection of cyber-crime related problems is relatively new and has encountered so many challenges (Rosa et al., 2019).
The availability of appropriate data, huge data skew because of natural uneven distribution of hate speech content on social media and NLP resources scarcity represents one amongst many significant issues in research on cyberbullying detection (Mahlangu et al., 2018;Gencoglu, 2020). A handful of studies are contributed by scholars to develop resources and cyberbullying detection strategies in different languages worldwide. Most studies have hateful instances ranging from 2 to 5% (Emmery et al., 2020).
The study by Sprugnoli et al. (2018), developed a WhatsApp dataset from WhatsApp chats to study offensive language among Italian students. They also presented annotation scheme and user roles. Research work accomplished in Fersini et al. (2018) collected misogynous and hateful tweets data using a combined approach. They monitored prospective victims of hate accounts, downloading the history of identified haters and filtered twitter stream contents via keywords.
The study conducted in Fišer et al. (2017) extracted data from an online platform that collects impulsive reports by internet users of any material having Child sex abuse; a special category of cyber-aggressiveness, to develop a corpus. The validation of corpus by experts revealed that only 3% was illegal content and more than 40% in non-disturbing content. Indonesian language hateful corpus was contributed in Ibrohim and Budi (2018). The research work used twitter platform, crowdsourcing annotation, and a multi-level scheme to identify Hate speech and non-hate speech categories along with their intensity levels. Work carried out in Bohra et al. (2018) presents a dataset comprising of Hindi-English code-mixed data. The tweets are annotated with the language at word level and the class they belong to (Hate Speech or Normal Speech).
The study in Van Bruwaene et al. (2020) formed a dataset using multiple platforms in English language from SafeToNet's VISR-branded child safety app for adolescents. In collaboration with expert annotators, they utilized crowd sourcing and machine learning techniques to enlarge the corpus and handle skew in iterative manner. The work by Özel et al. (2017) is the first study performed in Turkish Language. The research has contributed corpus in Turkish language prepared using Instagram and twitter social media platforms. Experimentation is also conducted using machine learning techniques.
Undeniably, English is the de facto common language among researchers at international level, hence greater number of computational resources, as highlighted by a review study (Poletto et al., 2020) are English corpora and datasets. Nevertheless, several other languages are represented too, and this certainly is immensely significant for international community that seeks to address a worldwide social issue of cyberbullying and hate speech spread in many languages. Roman Urdu has become a contemporary trend these days as a language of communication for Pakistani or more generally Asian youth. To the best of knowledge, this is the first ever study that has developed computational linguistic resource of Roman Urdu for Cyberaggression. This research study presents our approach for collecting and annotating social media data to develop a cyberbullying corpus in Roman Urdu language and domain specific stop words. The extraction of data was a multi-phase process to ensure high quality data with minimum skew. It encompasses vast range of content inciting hatred. The content is also based on wide bullying tactics like race, ethnic origin, religious affiliation, sexual orientation, caste, gender, identity and serious disease or disability Intelligence. Since a natural distribution of social media data is heavily skewed which results in a scarcity of bullying instances to be used in training, So we did extraction in phases. Moreover, this work used different weighing schemes for automatic identification and bilingual expert annotators manual input to develop stop words related to cyberbullying detection problem in Roman Urdu.

DATA EXTRACTION
Twitter is one of the most popular microblogging service having 316 million monthly active users. As compared to other social media platforms, twitter has attracted more to the academic researchers as it makes its data available for research purposes via Application Programming Interface (API) (Ahmed et al., 2017). To develop cyberbullying corpus, data was scrapped from twitter using python language, tweepy and twitter streaming API in multiple phases over the duration of 3 months as depicted in Figure   2. The reasons were twofold: (i) Restrictions on data access imposed on standard API (ii) The natural distribution of content is highly skewed. Currently, no language code is available for Roman Urdu in API, So the queries for data collection were formed based on geo-location information; taking coordinates from google maps of the areas in Pakistan where high saturation of Roman Urdu content was expected. Secondly, we extracted tweets based on insulting seeds or curse words typically used in Roman Urdu language for bullying. Thirdly we used hash values for aggression and trash talking on recent topics from the regions in Pakistan. Substantial number of tweets were in English Language and such content was filtered out leaving behind 3K tweets. In order to retain writing patterns of Roman Urdu users on social media, data with inherent English words (such as batting, topic, character, bowling, follow, design, ok, yes, no, music, video, free, hope, player, code, development, etc.)

DATA ANNOTATION PROCESS
Data annotation is indeed a Human Intelligence Task (HIT). Undeniably, crowdsourcing has obvious organizational advantages, especially for a time consuming task as the annotation of textual data, but annotation quality might get compromised from employing non-expert annotators typically for recently evolved languages and challenging task like cyberaggression (Schmidt & Wiegand, 2017). Moreover, many studies have uncovered that a non-trivial percentage of the data collected on MTurk is "doubtous", annotated either by "non-respondents" (bots instead of humans) or non-serious respondents (Dreyfuss et al., 2018;Ahler et al., 2019).
Instead of crowdsourcing, data was annotated by linguistic experts having bilingual expertise (having good knowledge of Nastaliq scripting and Roman Urdu patterns). Annotators were provided with guidelines on how to label social media documents for bullying. The main task of each annotator was to label each sample with one of three possible labels: 0 -text certainly does not contain any form of online violence, hate speech or abusive language.
1 -text certainly contains any form of online violence, hate speech or abusive language 2 -Indeterminate case(doubtful) when text cannot be identified with good certainty to either contain or do not contain any form of online violence, hate speech or abusive language. To provide the annotators with some context and preserve original writing intentions, all posts were presented in their original form i-e before applying major text preprocessing techniques, wherever possible. Social Media data is considered as Microtext . Microtext is extremely noisy. Deep comprehension of microtext is immensely important for effective understanding and further processing. Hence preliminary Text preprocessing was applied to handle Unicodes, punctuations, hashtags, emojis, URLs, case conversion, date and time data, insignificant string literals, and @ user mentions. The preprocessing results are given in Figure 4. To avoid any incorrect annotation, the linguistic experts were encouraged to use "2" whenever they have doubts if an instance contains any form of aggressive speech or not. The example of such doubtful instances is given in Figure 5. Because of the huge size of dataset, the whole annotation process was split into multiple phases. The entire annotation process was conducted by linguistic experts as per developed guidelines and categories. Based on these phases, the final labels were determined using weighting scheme described in section 5. The main concern of annotators was to analyze which types of phenomena in Roman Urdu can be considered as cyberbullying/hate speech (e.g. attack on personality characteristics, threats, curse, blackmails) or profanity/ abusive language and aggressiveness that might harm an individual's physical or mental health, lower self-esteem or hurt feelings. The sample data instances for each category are given in Table 1.

WEIGHING SCHEME
Since the process of annotation is a subjective phenomenon, hence different annotators might have a bit different judgment for the same textual comments. To eliminate dubious data and to ensure data quality and for further scrutiny and validation of dataset, Inter-Rater Reliability (IRR/IRA) was estimated using Cohen's kappa coefficient (κ) (definition 5.1). for measurement of inter-annotator agreement, Cohen's kappa coefficient has been accepted as the de facto standard. Empirically, based on kappa score the threshold value or cut off value is set. Usually, kappa score of 0.67 is used as a cutoff in computational linguistics (Wang et al., 2019).

DEFINITION AND COMPUTATION OF COHEN' S KAPPA
Cohen's kappa is a function of po, the relative observed agreement, and pe, the expected hypothetical agreement by chance. Mathematically it can be stated as in equation 1.
....... (3) Where N indicates the number of items, k is the number of categories, nk1 is the number of times rater 1 selected category k, and nk2 is the number of times rater 2 selected category k.
The kappa score was computed using Statistical Software (n.d.). The standard error and 95% confidence interval are calculated according to Fleiss et al. (2013). The weighted kappa score as shown in Table 2 is significantly higher than the cutoff value.

IDENTIFICATION OF DOMAIN SPECIFIC STOP WORDS
In this era of big data and information retrieval, process optimization for Text and Data Analytic systems becomes immensely significant. Therefore, to achieve accuracy, identification & filtration of terms with minimal or no semantic meaning is significant. Stop words has been developed in so many languages like English, Italian, Chinese, Arabic, Punjabi, Hindi, etc. (Kaur & Saini, 2015, 2016Hao & Hao, 2008;Alajmi et al., 2012) and are also part of NLTK, spaCy, and gensim.
Stop words not only vary from corpora to corpora, language to language, they also vary from one problem domain to another. For example, in a corpus of news articles, comprising of crucial information that is time-sensitive and location-sensitive, eliminating terms like "here", "today", etc. would affect results of related NLP application. This is because news articles link and relate current event to the similar events that had happened in the past or on another location.
In this work, we have identified stop-words from developed corpora, specific to the domain of cyberbullying and hate speech using statistical methods and human evaluation i-e direct Term Frequency (TF), Inverse-Document Frequency (IDF), Term-Frequency-Inverse-Document-Frequency (TFIDF) weighting model and human evaluation by bilingual experts. The methods are described below.
Let tf(t,d) denotes direct frequency of term t in document d. Mathematically, it can be defined as:

tf (t, d) = fd(t) / maxw€d fd(w)
where fd(w) indicates the total number of words in a document. The metric term frequency highlights commonality of a term within a collection of documents. Semantically least significant terms are expected to have high term frequency. 30 samples, based on term frequencies are depicted in Figure 6.  Figure 6. Samples based on term frequencies (n=30). Source: own elaboration.
Inverse document frequency, idf can be defined mathematically as:
This metric is very significant since it penalizes the more frequently occurring terms and favors the ones occurring in a few documents only. The lower bound of this metric is 0 and refers to the terms that appear in every single document in the corpus 40. Feature names sorted by their idf weights in ascending order are given in Figure 7.

CONCLUSIONS
Social networking sites has become a communal breeding ground for youth to aggress one another.
During the pandemic, the traffic in cyberspace increased significantly. Recent studies and published news reports highlight that there is great surge in the number of cyberbullying and harassment cases during the pandemic. This paper made novel contributions and achieved important landmark in regard of NLP on Roman Urdu language which has been embraced recently on social media by youth. The corpus was developed for cyberbullying and hate speech by collecting data for over 3 months to have high quality data with less skew. A well-defined set of data annotation guidelines were prepared and provided to experts for annotation. Since the process of annotation was a subjective phenomenon, hence, for further scrutiny and validation of corpus, Inter-Rater Reliability (IRR) based on Cohen's kappa coefficient(κ) was identified using statistical software Medcalc. Finally, the work rigorously developed an automatic stop word identification strategy using statistical methods and weighing model. Moreover, manual input a b of linguistic experts was also taken to form comprehensive list specific to cyberbullying corpus. In future we plan to conduct experiments using deep learning approaches and hand engineered feature sets. As the field of cyberaggression and cyberbullying is at its emergence so this research would greatly benefit NLP and machine learning research community.