A LEXICON BASED APPROACH TOWARDS CONCEPT EXTRACTION

The emergence of digital media has tremendously increased amount of unstructured data. Recently 80% of data, generated over web, is in unstructured format. This immense amount of data is a great source for the knowledge discovery and thus, may be utilized for extracting purposeful information. This study adopted lexicon based approach for automatic concept extraction from online news stories and events. An application prototype has been developed to demonstrate the applicability and effectiveness of the adopted approach. The extracted knowledge about news stories, articles and blogs is essential in understanding in-depth information for news analysts. This knowledge plays vital role towards building societies, since media is considered as an opinion maker for its audience.


INTRODUCTION
The digital age has provided an immense amount of data in terms of news articles, social media data, and web LaValle, Lesser, Shockley, Hopkins & Kruschwitz, 2014;Gharehchopogh & Khalifelu, 2011). Every day, a large amount of data is published on the news websites, micro-blogging websites and other information repositories (Lei, Rao, Li, Quan & Wenyin, 2014). The published news articles reveal the events happening around the world (Lei, et al., 2014). The challenging issue, specifically, in the textual data format (i.e., news articles) is to extract purposeful information. Manually, it is a hard task to interpret a large collection of data (Lee, Park, Kim & No, 2013). Besides, the information hidden in unstructured data format inherently makes it difficult processing tasks, because it deals with natural language processing. Therefore, in the current era of information flow, media analysts and other researchers need an easily understandable and high-level summary of information. For instance, a media analyst may require searching news regarding a certain topic, events happening to a certain geo-location, and/or news events based on a timeline. These and other such queries are objectives, which requires an efficient method to answer such queries.
Text Analytics allows knowledge discovery and purposeful finding of information from such a massive amount of data for investigation. The extracted knowledge can be used for better decision-making strategies and effective resource management. Therefore, extracting purposeful knowledge from large data having natural language involvement is an open challenge, which acquires sophisticated methods and algorithms to deal with it. To this aim, this research study extracts concepts from a large number of news stories and articles. The concept extraction refers to a meaningful sequence of words that are used to represent objects, events, activities, entities (real or imaginary), topics or ideas, which are of interest to the users (Parameswaran, Garcia-Molina & Rajaraman, 2010;Szwed, 2015). The concept extraction technique is a very effective way of extracting all the possible useful and meaningful concepts from text documents. The extracted concepts, later, may be tagged as essential concepts and may be represented in an efficient mechanism (Zhang, Mukherjee & Soetarman, 2013). The concepts, especially, present the understanding of the unstructured data format. The coverage and patterns of such concepts help in understanding in-depth about the news stories, news articles and inclination of the author's mindset. This knowledge about news stories, articles and blogs are essential for news analysts and plays a vital role in building societies, because media plays the role of opinion maker for the inhabitants of society.
An application prototype has been developed in this study to demonstrate the automated concept extraction that works based on lexicon approach. On the contrary, the machine-learning approach (i.e., supervised learning) inherently possesses challenges due to unstructured data format. Whilst, the lexicon-based approach has produced comparatively better results. The developed prototype presents the applicability and effectiveness of the considered approach. This paper is structured as follows: section 2 reports existing scientific literature about concept extraction, section 3 describes the architecture of the developed application. Results and discussion are reported in section 4, and finally, section 5 concludes.

RELATED WORK
The concept extraction has been remained focus of in the recent existing literature (Sˇili ć, et al., 2012;Parameswaran, et al., 2010;Villalon, et al., 2009, Weichselbraun, et al., 2013Termehchy, et al., 2014;Brin, 1998, Mahmood, et al., 2018. Specially, concept extraction in the context of online news has become ta opic of interest. For instance, social emotions have been detected using a lexiconbased approach from news articles in (Lei, et al., 2014). CatViz Temporally -Sliced Correspondence Analysis Visualization performs exploratory text analysis on large collection of textual data. The basis of CatViz is Correspondence Analysis (CA) and allows va isual analysis of different aspects of text data (Sˇili ć, et al., 2012).
Extraction of concepts from query log data repository has been carried out in Parameswaran, Garcia-Molina and Rajaraman (2010), where sub-concepts and super concepts are pruned. The core concepts are taken into consideration, which is oriented on frequency, better meaning and idea. Similarly, automatic concept extraction from essays written by students in order to draw concept maps is reported in Villalon and Calvo (2009) for the concept map mining purpose. The limitations faced in machine-learning approach during training model have been addressed in Weichselbraun, Gindl and Scharl (2013). Two potentially efficient algorithms have been proposed Termehchy, Vakilian, Chodpathumwan and Winslett (2014), namely: 1) Approximate Popularity Maximization (APM) and Annotation-benefit Maximization (AAM). The patterns hidden in web documents have been explored in Brin (1998), where patterns are analyzed for concept determination.
The Dawn (newspaper) and The New York Times (newspaper) have been focused on Mahmood, Kausar and Khan (2018) for the purpose of textual analysis. `This study also focused on online news stories and events published at The Dawn (newspaper) as the data source in order to automatically extract concepts using lexicon-based approach. The dictionaries used for the understanding of concepts and meanings of the terms and/or concepts are WordNet and DBPedia.

LEXICON BASED CONCEPT EXTRACTION APPROACH
An application prototype has been developed for online news data in order to extract key concepts. The prototype is developed using C# (c sharp) programming language.
The application architecture of the developed prototype comprised of three layers: Layer 1: Data Source, Layer 2: Middleware and Layer 3: News Mining as shown in Figure 1. The purpose of each layer is reported in the subsequent sections.

LAYER 1: DATA SOURCE/PROVIDER
Data source/Provider layer crawls online news events and stories published at The Dawn 1 newspaper official website. The application, however, allows providing URL (Uniform Resource Locator) of a certain news website. This study has focused on the news stories and articles of The Dawn newspaper. This layer traverses the given URL to crawl its news events and stories available at its several webpages. The crawler uses the existing APIs (Application Programming Interfaces) for the traversing and retrieval of data from the source website (The Dawn in this case).

LAYER 2: MIDDLEWARE
Middleware layer takes the news stories and articles and parses the given obtained news stories. In particular, HTML (Hypertext Markup Language) parser and DOM (Document Object Module) API has been used for the processing of news stories. The parsed and processed data is stored in the relational database.
A relational database is the collection of data into table formats, which are logically related to each other. The news stories and articles comprised of several tags as represented in Figure 2. Middleware layer takes the news stories and articles and parses the given obtained news stories. In particular, HTML (Hypertext Markup Language) parser and DOM (Document Object Module) API has been used for the processing of news stories. The parsed and processed data is stored in the relational database. A relational database is the collection of data into table formats, which are logically related to each other. The news stories and articles comprised of several tags as represented in Figure 2.
HTML Parser is, basically, a library and it is used in parsing if text files formatted in HTML. Likewise, DOM API, written in JavaScript, is an object representation of webpage. The news stores and articles are provided as an input to the third layer of the developed prototype application.

LAYER 3: NEWS MINING
This layer is the key layer that actually automatically extracts concepts present in the collected news stories and articles. In particular, this layer comprised of Mining Manager, which performs necessary text preprocessing steps to transform the collected and stored news stories and articles into a suitable format for further processing.
Mining Manager: it performs tokenization, stemming, stopword removal operations before actual processing of automatic concept extraction.
Tokenization: this operation breaks given textual data into its tokens (i.e., terms or words). For instance, consider a sentence 'This study aims at automatic concept extraction using lexicon-based approach.' The tokenization produces the following outcomes: 'This', '
Stopword Removal: this operation prunes unnecessary words present in the text. These unnecessary words usually refer to auxiliary verbs and grammatically articles. For example, 'the' 'is' 'am' 'are' 'was' 'and' 'a' 'an' and many more.
The processed tokens are further used as an input for automatic concept extraction. The application uses popular bag-of-words (BOW) as vector space representation model for the processed tokens. The words that are left after stopword removal operation are the bag-of-words, each word has its frequency in certain news story or article. The BOW is supplied to Concept Extraction module for determining concepts.
Concept Extraction: The concept extraction module determines the meaning and being concept state of terms, which have been processed at Mining Manager. The BOW is further supplied to concept extraction module as shown in Figure  3 that is connected with dictionaries: WordNet, DBpedia and Linked data to determine the meaning and concept for a given word of BOW. Each word in BOW undergoes for the concept extraction process. The outcomes of the concepts are later used for visualization. In particular, the frequency of the concepts is measured for a given article or news story. The word cloud is displayed for the concepts and graphs represent the trends of the concepts available in the news.

RESULTS AND DISCUSSION
This section discusses the outcomes of the developed application prototype. Figure 4 represents the crawled data. A certain URL of the newspaper website is provided to crawl its data and store into a database. The collected story is displayed at the user interface of the application as in Figure 4.  Figure 5 represents the concepts extraction and the frequencies used for the BOW as an input for concept extraction module discussed in section 3.3. Since the BOW is large in number, the prototype allows increasing or decreasing the number terms in BOW based on their frequencies. The graphs and concepts in terms of word cloud are presented in the developed prototype for a better understanding of concepts present in the news stories and articles as reported in Figure 6. The outcomes of the approach help in understanding in-depth news stories and articles, which may be used as a baseline for the decision-making strategies. The news media has been used widely for opinion making purposes. Thus, the extracted concepts help in getting an insight into the news events, articles and mindset of the journalists.

RESEARCH CHALLENGES AND LIMITATIONS
To acquire data for this study, The Dawn newspaper has been targeted due to its popularity and neutrality. This could be considered as the limitation of the study since the emphasis of the study remained over concept extraction using lexicon approach. However, the developed approach may also be provided a dataset of any other newspaper.
The challenges that have been encountered during the course of the research study is PakistaniEnglish words. The injection of Urdu words in English has been referred to as Pakistani English. For instance, chai-wala, ziaism, Sahab and Naya Pakistan are some of the PakistaniEnglish vocabulary. The challenge is to determine the concepts from this derived vocabulary. PakistaniEnglish vocabulary has been not addressed in the study due to lack of its lexical chains and thorough grammatical aspects that help in understanding words.

CONCLUSION
This study reported a lexicon-based approach for concept extraction. In particular, a working prototype has been developed to demonstrate the applicability and effectiveness of the approach. The application automatically crawls news events and stories, which are stored in a relational database. The focus remained on The Dawn newspaper for the data source due to its neutrality and popularity in the region.
The collected news data has been gone through necessary text processing phases in order to transform it for further process. The concepts have been extracted with the help of WordNet, DBPedia and Linked Data. The extracted concepts, later, have been displayed with visualization techniques such as Word Cloud and charts. Generally, media has an influential role over the minds of its audience. Thus, the extracted concepts may help in understanding the core concepts available in the news events and stories that may lead to strategic decision-making. The outcomes of this study may assist and help media analysts to have an in-depth understanding of media personnel and general public opinion about news and facts on the ground.
The deviation of lexical chains in terms of PakistaniEnglish words will be considered as future work. In particular, developing PakistaniEnglish corpus to tackle the limitations of this study would be a focus in future work.