DATA PREPROCESSING: A PRELIMINARY STEP FOR WEB DATA MINING

In recent years immense growth of data i.e. big data is observed resulting in a brighter and more optimized future. Big Data demands large computational infrastructure with high–performance processing capabilities. Preparing big data for mining and analysis is a challenging task and requires data to be preprocessed to improve the quality of raw data. The data instance representation and quality are foremost. Data preprocessing is preliminary data mining practice in which raw data is transformed into a format suitable for another processing procedure. Data preprocessing improves the data quality by cleaning, normalizing, transforming and extracting relevant feature from raw data. Data preprocessing significantly improve the performance of machine learning algorithms which in turn leads to accurate data mining. Knowledge discovery from noisy, irrelevant and redundant data is a difficult task therefore precise identification of extreme values and outlier, filling up missing values poses challenges. This paper discusses various big data pre–processing techniques in order to prepare it for mining and analysis tasks.


INTRODUCTION
Year after year, organizations have realized the benefits that big data analytics provides. Data scientist and researchers demands for the evolution of current practices for processing raw data. Automated Information extraction is impossible from the huge data repository as most data is unstructured. Cloud computing services have also lead us with a growing rate of data on the web as these services are cost-effective and easy to use. This phenomenon undoubtedly signifies a challenge for the data scientist and analyst, therefore Big Data characterized as very high volume, velocity and variety require new high-performance processing (Xindong, Xingquan, Gong-Qing & Ding, 2014). Process of extraction of relevant and useful information from the data deluge is known as data mining which is utterly dependent on the quality of data. The raw data is usually vulnerable to noise, and is incomplete or inconsistent and contain outlier values. Thus, this data has to be processed prior to the application of data mining (Alasadi & Bhaya, 2017). Data preprocessing involves the transformation of the raw dataset into an understandable format. Preprocessing data is a fundamental stage in data mining to improve data efficiency. The data preprocessing methods directly affect the outcomes of any analytic algorithm; however, the methods of pre-processing may vary for the area of application. Data pre-processing is a significant stage in the data mining process. According to a report by Aberdeen Group, data preparation refers to any action intended to increase the quality, usability, accessibility, or portability of data. The ultimate objective of data preparation is to allow analytical systems with clean and consumable data to be transformed into actionable insights. Data preprocessing embrace numerous practices such as cleaning, integration, transformation and reduction. The preprocessing phase may consume a substantial amount of time but the outcome is a final data set, which is anticipated correct and beneficial for further data mining algorithms.

BACKGROUND
The raw data available on data warehouse, data marts, database files (Jiawei, Micheline & Jian, 2012) are generally not organized for analysis as it may be incomplete, inconsistent or it may be distributed into a various table or represented in a different format, in short, it is dirty. The process of discovering knowledge from the massive chronological data sources is called as Knowledge Discovery in databases (KDD) or Data Mining (Malley, Ramazzotti & Wu, 2016;Gupta & Gurpreet, 2009). It is the era of big data and every field of life are generating data at a drastic level. The most challenging task is to gain the right information from present data sources.
The task of reorganizing data is known as data preparation. It is used to discover the anticipated knowledge. It incorporates understanding domain based problem under consideration and then a collection of targeted data to achieve anticipated goals (Gülser, İnci & Murat, 2011). Forrester estimates up to 80 per cent of data analyst time is consumed in preparing data (Goetz, 2015). The selected data can then be preprocessed for data mining. Data pre-processing is the finest solution to increase data quality. Data preprocessing includes cleansing of data, normalization of data, transformation, feature extraction and selection, etc. The processed data is the training set for the machine learning algorithm.

DATA CLEANING
The first stage of data preprocessing is Data cleaning which recognizes partial, incorrect, imprecise or inappropriate parts of the data from datasets (Tamraparni & Theodore, 2003). Data cleaning may eliminate typographical errors. It may ignore tuple contains missing values or alter values compared to a known list of entities. The data then becomes consistent with other data sets available in the system. Precisely, data cleaning comprises the following four basic steps as described in Table 1.

Steps Description
Data Analysis Dirty data detection by reviewing dataset, quality of data, meta data.

Define Work Flow
Define the cleaning rules by considering heterogeneity degree among diverse data source, then make the work flow order of cleaning rules such as cleaning particular data type, condition, strategy to apply etc.

Execute defined rules
Rendering the defined rules on source dataset process, and display resulted in clean data to the user.

Verification
Verify the accuracy and efficiency of the cleaning rules whether it content user requirements.
Step 2-3 repetitively executed till all problems related to data quality get solved. Repeat steps 1-4 until user requirements are met to clean the data.
Handling missing values is difficult as improperly handled the missing values may lead to poor knowledge extracted (Hai & Shouhong, 2009). Expectation-Maximization (EM) algorithm, Imputation, filtering are generally considered for handling missing values ("Expectation maximization algorithm"). Various data cleansing solutions apply validated data set on dirty data in-order to clean it. Some tools use data enhancement techniques which makes incomplete data set complete by the addition of related information. Binning methods can be used to remove noisy data. Clustering technique is used to detect outliers (Jiawei, et al., 2012). Data can also be smooth out by fitting it into a regression function. Numerous regression procedures such as linear, multiple or logistic regression are used to regulate regression function.

DATA INTEGRATION
Data Integration is the method of merging data derived from different sources of data into a consistent dataset. Data on the web is expanding in size and complexity, and is either unstructured or semi-structures. Integration of data is an extremely cumbersome and iterative process. The considerations during the integration process are mostly related to standards of heterogeneous data sources. Secondly, the process of integrating new data sources to the existing dataset is time-consuming, ultimately results in inappropriate consumption of valuable information. ELT (Extract-Transform-Load) tools are used to handle a larger volume of data; it integrates diverse sources into a single physical location, provides uniform conceptual schemas and provides querying capabilities.

DATA TRANSFORMATION
Raw data is usually transformed into a format suitable for analysis. Data can be normalized for instance transformation of the numerical variable to a common range. Data normalization can be achieved using range normalization technique or z-score method. Categorical data can also be transformed using aggregation which merges two or more attributes into a single attribute. Generalization can be applied on low-level attributes which are transformed to a higher level.

DATA REDUCTION
Multifaceted exploration of huge data sources may consume considerable time or even be infeasible. When the number of predictor variables or the number of instances becomes large, mining algorithms suffer from dimensionality handling problems (Jiawei, et al., 2012). The last stage of data preprocessing is data reduction. Data reduction makes input data more effective in representation without loosening its integrity. Data reduction may or may not be lossless. The end database may contain all the information of the original database in wellorganized format (Bellatreche & Chakravarthy, 2017). Encoding techniques, hierarchy distribution data cube aggregation can be used to reduce the size of the dataset. Data reduction harmonizes feature selection process. Instance selection (Vijayarani, Ilamathi & Nithya, 2015) and Instance generation are two approaches used by data mining algorithm to reduce data size.

WEB DATA PREPROCESSING FRAMEWORK
World Wide Web is a huge repository of an awful textual data most of it being created on a daily basis, reaching from structured to semi-structured to completely unstructured (Andrew, 2015). How can we utilize that data in a productive way? What can we do with it? The answer to these two questions is totally dependent on what is our objective. To leverage the availability of all of this data, it has to be preprocessed which entails various steps but it may or not apply to a given task, but usually plunge below the broad categories of tokenization, normalization, and substitution.
• Tokenization; in textual data preprocessing tokenization is used to spit long strings of text into smaller one for example sentences can be tokenized into words, etc. It is also known as text segmentation or lexical analysis.
• Normalization; It generally refers to a series of related tasks in order to places all words on equal footing or uniformity. For instance performing stemming, lemmatization, changing the case upper to lower or lower to upper, punctuation, space or stop words removal, the substitution of numbers with their equivalent words etc.
• Substitution or Noise Removal; text data on websites is wrapped in HTML or XML tags, pattern matching or regular expression can be used to extract desired text by removing HTML, XML, etc. markup and metadata.

CASE STUDY
Our objective is to do preprocessing on the predetermined body of text; so that we are left with artifact's which will be more valuable and meaningful for any text mining algorithm. The approach proposed here is fully applicable to any web page content. We will remove noise, in our case, it is HTML tags and substitutes English language contraction. Then the content will be tokenized and finally, we are going to normalize the text. We have used PHP as a scripting language to perform preprocessing on the text. We have explored PHP Natural Language Toolkit for tokenization. Figure 3 shows a dummy HTML page content certainly, but the steps to preprocessing this data are fully transferable.

NOISE REMOVAL AND SUBSTITUTION
The data preprocessing pipeline will start with noise removal as it is not task depended. The line of codes in Figure 4 reads in the text file called sample.txt which contains dummy HTML data shown in Figure 3. It calls PHP built-in function to strip of HTML tags. It is beneficial to remove English language contraction with their expansion before tokenization as it will split word such as "didn't" into "did" and "n't" rather than "did" and "not". We implemented contradiction expansion by calling list of contraction available in MYSQL database and then comparing it with our content. It then replaced every occurrence of matched contraction will expansion.

TOKENIZATION
For tokenization, we have used PHP Natural Language Processing (NLP) toolkit. PHP supports various kinds of tokenization under tokenizers namespace. We are using RegexTokenizer.
Stemming: The aim of this step is to condense inflectional forms of a word to a common base form. For instance: cars to car Everything Else: This step will transform all word into lowercase, remove nonascii words, remove punctuations, replace numbers, and remove stop word.  The simple text data preprocessing process results are shown in Figure 11.

CONCLUSION
Any data analysis algorithm will fail to discover hidden pattern or trend from data if the dataset under observation is inadequate, irrelevant or incomplete. Thus data preprocessing is a central phase in any data analysis process. The preprocessing of data resolves numerous kinds of problems such as noisy, redundancy, missing values, etc. High quality results are only achievable with high quality of data which in turn also reduce the cost for data mining. The foundation of decision making system in any organization is the three C's properties of data i.e. Completeness, Consistency and Correctness. Deprived quality of data quality effects decision making process which eventually decreases customer's satisfaction. Furthermore larger dataset affects the performance of any machine learning algorithm, therefore instance selection lessens data and is efficient approach to make machine learning algorithm work effectively.