IMPLEMENTATION OF ENSEMBLE METHOD ON D NA DATA USI N G VARI OUS CROSS VALIDATION TECHNIQUES

Due to the growing size of datasets, which contain hundreds or thousands of features, feature selection has drawn the interest of many scholars in recent years. Usually, not all columns show important values. As a result, the machine learning models may perform poorly since the noise or unnecessary columns may confound the algorithms. To address this issue, various feature selection methods have been developed to evaluate large dimensional datasets and identify their subsets of pertinent features. The data, however, frequently skews feature selection algorithms. As a result, ensemble approaches have emerged as a substitute that incorporates the benefits of single feature selection algorithms and makes up for their drawbacks. In order to handle feature selection on datasets with large dimensionality, this research aims to grasp the key ideas and links in the process of aggregating feature selection methods. The suggested idea is tested by creating a cross-validation implementation that combines a number of Python packages with functionality to enable the feature selection techniques. By identifying pertinent features in the human, chimpanzee, and dog DNA datasets, the performance of the implementation was demonstrated.


INTRODUCTION
In recent years, datasets with a lot of attributes have become more common in several fields.Microarray categorization serves as the best illustration.Numerous datasets containing this type of data have been produced as a result of improvements in DNA microarray.The majority of these datasets show that the ratio of instances to features, which range from 6 to 60 genes, is not greater than this.However, most of the genes in these datasets do not represent helpful information to support a machine learning process.In order to efficiently classify microarray data, a pre-processing stage is therefore required.This article will explain how to do so by choosing a representative subset of genes from the original set of genes(Mera-Gaona, LÅLopez, Vargas-Canas, and Neumann, 2021) [16].The individual success of the ensemble's basis learners and the independence of the base learners' results due to low error and great diversity are the two major factors that determine how well an ensemble performs.By utilising foundation learners of the same or different types, diverse base learners can be built.When using the same type of base learners, diversity is produced by giving each base learner in the ensemble a different training set.Different training data sets can be created using a variety of techniques, including bagging, boosting, random subspaces, random forests, and rotation forests.In order to créate a superior composite global model with more precise and trustworthy estimates or conclusions than can be produced by utilising a single model, an ensemble methodology combines a group of models, each of which addresses the same original problem.The fact that different classifier types have distinct inductive biases is one of the key reasons why ensemble methods are so successful (Gopika1 and Azhagusundari, 2014) [9].Finding ways to enhance feature selection on datasets with high dimensionality and few examples is the major goal of this work.Additionally, cross validation is used in the display of ensemble methods to combine the benefits of several feature selection algorithms, avoid their biases, and make up for their shortcomings (Mera-Gaona et al., 2021)[16].

ENSEMBLE METHODS
The Ensemble categorization is founded on the idea that several experts can provide more accurate judgments than a single expert.A single composite model with higher accuracy is produced through ensemble modelling, which combines the collection of classifiers.According to research, predictions from a composite model provide better outcomes than predictions from a single model.Since the previous few decades, ensemble technique research has gained popularity.The outputs of many classifiers are combined, which minimises generalisation error, according to a number of experimental tests carried out by machine learning experts.The ensemble approaches are described in this section (Pandey and Taruna, 2014) [10][11]. (

1) Bagging
The bagging technique is used to reduce variance, and the bagging ensemble method's goal is to divide the dataset into several subsets for training that are randomly chosen with replacement(Singh and Pal, 2020) [10].The Bootstrap sampling approach provides the basis for bagging.A distinct set of bootstrap samples is produced for each iteration of the procedure in order to build a unique classifier.During the sample phase of the bootstrap sampling approach, data items are chosen at random with replacement, meaning that some instances may be repeated or some may be omitted from the original dataset.Combining all of the classifiers built in the previous phase is the next stage in the bagging process.To arrive at a final prediction, bagging combines the output of the classifiers with input from the voting process a12 [11][12].

(2) Boosting
Another crucial ensemble method is the boosting classifier.It is used to develop a collection of classifiers.By fitting classifiers to data and then assessing mistakes, classifiers are serially trained in the boosting approach (Singh and Pal, 2020) [10].The weak classifier's performance is improved by boosting to a strong level.With the help of reweighting the data instances, it creates sequential learning classifiers.All the instances are given initial weights that are equal and https://doi.org/10.17993/3ctecno.2022.v11n2e42.59-69consistent.Each time a learning phase is completed, a new hypothesis is taught, and the examples are reweighted such that instances that were properly identified during that pase have a lower weight and the system may focus on instances that weren't.Instances that were incorrectly categorised are chosen so they can be correctly categorised in the following learning stage.This procedure keeps on till the final classifier is built.To arrive at the final forecast, the output of each classifier is finally merged using majority voting.The Boosting method has been generalised in AdaBoost(Breiman, ) [12].

(3) Random Subspaces
The approach comes in two different types.Each base learner is taught using a distinct feature subspace of the initial training data set at the first form.Only decision trees may be utilised as the base learner at the second form (Gopika1 and Azhagusundari, 2014) [9].(4) Random Forest Breiman proposed Random Forest.Bagging plus the second kind of random subspaces can be used to formulate it (Breiman) [12].The bagging and random subspace methods are combined to induce the tree.Although each model is a random tree rather than a single model, it differs from bagging in that each tree is created in accordance with the bootstrap sample of the training set to N. Each node is divided using yet another random step.Instead of examining all potential splits, a limited subset of features is randomly picked, and the optimum split is determined from this subset.Across all trees, the majority vote determines the final categorization [11].(5) Rotation Forest Rotation Forest is a brand-new ensemble approach built on the Principal Component Analysis (PCA) and decision trees.To create a training set for the base classifier using a K axis rotation of the feature subset, the attribute set F is randomly divided into K subgroups, and PCA is then performed separately to each subset.By keeping all of the PCA, Rotation Forest maintains all of the information.The basis classifier for Rotation Forest is the decisión tree(Pandey and Taruna, 2014) [11].

CROSS VALIDATION TECHNIQUES
A statistical technique called cross-validation determines how well a trained model will perform on unobserved data.By training the model on a subset of the input data and testing it on a different subset, the model's effectiveness is confirmed.Building a generalised model is assisted by crossvalidation.Cross-validation is helpful for both performance estimate and model selection since modelling is an iterative process.Cross-validation involves the following three steps: i. Split the dataset into two sections: a training section and a testing section.ii.Use the training dataset to train the model.iii.Use the testing set to gauge the model's effectiveness.Check for problems if the model doesn't perform well with the testing set.If a model can predict accurately for a variety of input data and does well on unknown data, it is stable and consistent.Evaluation of the stability of machine learning models is aided by crossvalidation.The dataset has to be divided into three separate sections for training and testing the model: • Training Data: Using the training data, the model is trained to discover the dataset's hidden characteristics and patterns.The model continually assesses the data to better understand its behaviour, and then it modifies itself to achieve its goal.Basically, it's employed to fit the models.This paper discusses eight alternative cross-validation approaches, each with advantages and disadvantages that are stated below (1) Leave p out cross-validation An exhaustive cross-validation strategy called leave p-out cross-validation uses the p-observation as validation data while utilising the remaining data to train the model.This is repeated in all possible ways on a validation set of p observations and a training set to trim the original sample.In order to estimate the area under the ROC curve of a binary classifier in a virtually unbiased manner, leave-pair-out cross-validation, a variation of Leave p-out with p=2, has been suggested (Kumar, 2020)[14].

(2) Leave one out cross-validation
A thorough cross-validation method is leave-one-out cross-validation.It falls within the leave p-out cross validation category with the instance of p=1.The first row of a dataset of n rows is chosen for validation, and the remaining n-1 rows are utilised to train the model.The second row is chosen for validation and the remainder is used to train the model for the following iteration.Similar to that, the procedure is repeated up to n operations or phases.Cross-validation techniques i.e. leave p-out and leave One-out that learn and test in every conceivable way are known as exhaustive crossvalidation techniques.They share the advantages such as straightforward, understandable, and simple to use and disadvantages such as the model might provide a little bias and a lot of computing time is needed [13][14].

(3) Holdout cross-validation
The dataset is randomly divided into training and validation data in holdout cross-validation.In general, training data are split more evenly than test data.The model is created using training data, and validation data is used to assess the model's effectiveness.The model becomes better as more data are used to train it.The holdout cross-validation approach isolates training data from a sizable amount of data.The advantages for this such as straightforward, understandable, and simple to use and disadvantages such as it's not suitable for an unbalanced dataset and a lot of data is not being used to train the model (Raschka, 2020) [5].

(4) Repeated random sub-sampling validation
The dataset is randomly divided into training and validation in repeated random subsampling validation, commonly known as Monte Carlo cross-validation.Unlikely k-fold cross-validation separates the dataset into random splits rather than groups or folds in this.Analysis determines the number of iterations; it is not a set quantity.The outcomes are then multiplied by the divides.Advantage for such validation is i.e. there is no relation exists between the number of iterations or divisions and the fraction of train and validation splits and the disadvantages such as possible that some samples won't be used for either training or validation and not appropriate for a dataset with imbalance [5][14].

(5) k-fold cross-validation
The original dataset is evenly divided into k subparts or folds for k-fold cross-validation.For each iteration, one of the k-folds or groups is chosen as the validation data, while the remaining (k-

2020). (6) Stratified k-fold cross-validation
All the cross-validation methods mentioned above might not be effective with an unbalanced dataset.Unbalanced dataset issue was resolved by stratified k-fold cross-validation.The dataset is divided into k groups or folds in stratified k-fold cross-validation such that the validation data has an equal number of instances of the target class label.This makes sure that, especially when the dataset is unbalanced, one specific class is not overrepresented in the validation or train data.The average of the scores for each fold is used to get the final score.As a benefit, it performs well for an unbalanced dataset (PAYAM REFAEILZADEH, 2008) [3][14].

(7) Time Series cross-validation
When dealing with problems involving time series, the data's order is crucial.Data divided or in k-folds into train and validation for time-related datasets might not produce the best results.The forward chaining method, also known as rolling cross-validation, is used to divide the time-series dataset's data into train and validation groups.The subsequent instance of train data can be used as validation data for a certain iteration(a13, ) [13].

BASIC PROCESS OF MACHINE LEARNING
The field of machine learning blends traditional statistical methods with computer science techniques.

CONCLUSION AND FUTURE WORK
This method of feature selection and feature extraction from DNA data sequence was successfully completed.Here, we employed K-mer counting, one-hot encoding, and ordinal encoding as the language for choosing DNA sequence features in python libraries.We have demonstrated the result using these libraries in the forms of a matrix, vector, and graph.In future, we also retrieved K-mers to use in the classifier process.

( 8 )
Nested cross-validationWe obtain a subpar estimate of the error in training and test data while using k-fold and stratified kfold cross-validation.In the prior techniques, hyper-parameter adjustment is done individually.Nested cross-validation is necessary when cross-validation is used to tune the hyper-parameters and generalise the error estimate at the same time.Both the stratified k-fold and k-fold variations can use nested cross validation [14].

Fig. 1 :
Fig. 1: Basic Process.Source: datavalley. 1.Data Collection: Gather all the information you need from the many systems that might contribute to your situation.2. Data Pre-processing: Prior to processing and analysis, raw data must be cleaned and transformed.Prior to processing, it is a crucial phase that frequently entails reformatting data, making adjustments to data, and fusing data sets to enhance data.-DataCleaning: The initial phase in data mining consists of removing incomplete or inconsistent data since data sets frequently contain missing data and inconsistent data.Low data quality will have a significant negative influence on the information extraction process.-DataIntegration: If the data to be examined come from many sources, they must be reliably aggregated.3. Feature Engineering: This covers all modifications made to the data, from cleaning it up to ingesting it into the machine learning model.You choose and prepare the features that will be used in your machine learning model in this stage, making sure they are in the format required by the model.4. Selection of Model: Choose the best model for the situation and then make any necessary adjustments.

Fig 3 DNA
Fig 3 DNA Sequence with class.

Fig 4
Fig 4 Class distribution graph.Feature Selection: Although the DNA sequence in a show is represented by characters, machine learning algorithms need numerical values or feature matrices.In order to convert these characters into values, we use three general approaches such as ordinal encoding, one hot encoding and kmers counting.Ordinal Encoding: With this method, each nitrogen base must be encoded as an ordinal value."A, T, G, and C," for instance, becomes [0.25, 0.5, 0.75, and 1.0].Any additional base, like "Z," may be a 0.

Fig 5 Fig 6 Feature
Fig 5 Python Code for ordinal encoding.

Fig 7 Fig 8 Feature
Fig 7 Python code for One-hot encoder.

Fig 9 Fig 10 Feature
Fig 9 Python code for K-mers counting.
• Validation Data: This is used to confirm that the model's training results were accurate.It aids in adjusting the hyper-parameters and settings of the model appropriately.The prediction error for model selection is estimated using the validation data.Validation data helps prevent overfitting models.• Test Data: Following training, the test data confirms that the trained model is capable of 3C Tecnología.Glosas de innovación aplicadas a la pyme.ISSN: 2254-4143 Ed. 42 Vol.11 N.º 2 August -December 2022 [13]ng precise predictions.It is used to evaluate the generalisation error of the last model chosen (Hulu and Sihombing, 2020)[1](Jung and A K-Fold, 2015)[7][8](Wu, )[13][14](??, ).https://doi.org/10.17993/3ctecno.2022.v11n2e42.59-69 1)groups are chosen as the training data.Until each group is considered as validation and the rest as training data, the procedure is repeated k times.The mean accuracy of the kmodels validation data is used to calculate the model's final accuracy.The model exhibits little bias, low temporal complexity and both training and validation use the complete dataset is the advantages and the (Raschka, 2020)daranar University, Tirunelveli (Tamil Nadu), India., Assistant Professor, Department of Statistics, Dr. Ambedkar Government Arts College, Vyasarpadi, Chennai (Tamil Nadu), India., and Assistant Professor.Department of Statistics, TMG College of Arts and Science, Chennai (Tamil Nadu), India., 2021)(Raschka, 2020)(Raschka, 3C Tecnología.Glosas de innovación aplicadas a la pyme.ISSN: 2254-4143 Ed. 42 Vol.11 N.º 2 August -December 2022 https://doi.org/10.17993/3ctecno.2022.v11n2e42.59-69