COMPARATIVE ANALYSIS OF SUPERVISED MACHINE LEARNING ALGORITHMS FOR HEART DISEASE DETECTION

This paper describes the most prominent algorithms of Supervised Machine Learning (SML), their characteristics, and comparatives in the way of treating data. The Heart Disease dataset obtained from Kaggle was used to determine and test its highest percentage of accuracy. To achieve the objective, Python sklearn libraries were used to implement the selected algorithms, evaluate and determine which algorithm is the one that obtains the best results, applying decision tree algorithms achieved the best prediction results.


INTRODUCTION
Machine learning is one of the fastest-growing areas of computer science (Srivastava et al., 2014), with long-range applications, which refers to the automatic detection of significant patterns in data with machine learning tools, which give programs the ability to learn and adapt.
Machine learning has become one of the pillars of information technology and, with that, a reasonably central, though generally hidden, part of our life. With the increasing amount of data available, there is a good reason to believe that intelligent data analysis will be even more widespread as a necessary ingredient for technological progress.
There are several applications for Machine Learning (ML), being one of the most important data mining (Bustamante, Rodríguez, & Esenarro, 2019). The handling of a large amount of data makes people more likely to make mistakes during analyzes or, possibly, when trying to establish relationships between multiple characteristics.
Data mining and machine learning go hand in hand with which several ideas can be derived through appropriate learning algorithms. There has been significant progress in data mining and machine learning as a result of the evolution of nanotechnology, which generated curiosity to find hidden patterns in the data to obtain results. The fusion of math and statistics, machine learning and artificial intelligence, information theory and big data, and hight processing computation, has created a reliable science, with a firm mathematical base and compelling tools. This paper focuses on the classification of ML algorithms and the determination of the most efficient algorithm with the best accuracy and precision. In addition to establishing the performance of different algorithms in large and small datasets with one view, classify them correctly, and provide information on how to build supervised machine learning models. typically the number of counts of a word in a report. However, the rate of convergence between the variables in the data set depends on the margin. In general terms, the margin quantifies how linearly separable a collection of data is and, therefore, how easy it is to solve a given classification problem.
2) Naive Bayesian Networks: These are elementary Bayesian networks that are composed of acyclic graphs directed with a single parent (representing the unobserved node) and several children (corresponding to the observed nodes) with a strong assumption of independence between nodes children in the context of their father.
Thus, the independence model (Naive Bayes) is based on the estimate. Bayes classifiers tend to be less accurate than other more sophisticated learning algorithms (such as Artificial Neural Networks). However, in a large-scale comparison of the Bayes naive classifier with state-of-the-art algorithms for decision tree induction, instance-based learning and rule induction in standard reference data sets, and discovered that it is sometimes superior to the other learning schemes, even in data sets with dependencies of substantial characteristics. The Bayes classifier has an attribute independence problem that was addressed with the average estimators of a dependence.
3C Tecnología. Glosas de innovación aplicadas a la pyme. ISSN: 2254 -4143 Edición Especial Special Issue Abril 2020 3) Support Vector Machines: This is the most recent supervised machine learning supervised technique. Support vector machine models (SVM) are closely related to classical multilayer perceptron neural networks. SVMs revolve around the notion of a "margin" on each side of a hyperplane that separates two kinds of data. It has been shown that maximizing the margin and, therefore, creating the most significant possible distance between the separation hyperplane and the instances on each side thereof reduces an upper limit on the expected generalization error.

4) K-means:
It is one of the simplest unsupervised learning algorithms that solve the known clustering problem. The procedure follows a simple and straightforward way to classify a given set of data through a certain number of groups (suppose k groups) set a priori. The K-Means algorithm is used when tagged data is not available (Bhavsar & Ganatra, 2012). General method of conversion approximate general rules into a highly accurate prediction rule. Given the "weak" learning algorithm that you can consistently find classifiers ("general rules") at least slightly better than random, say 55% accuracy, with sufficient data, a reinforcing algorithm can build a single classifier with very high precision, say 99%.

5) Decree Tree:
Decision trees (DT) are trees that classify instances by ordering them according to characteristic values. Each node in a decision tree represents a characteristic in an example that will be organized, and each branch represents a value that the node can assume. Instances are arranged from the root node and are sorted based on their characteristic values. The decision tree learning, used in data mining and machine learning, uses a decision tree as a predictive model that assigns observations on an element to conclusions about the objective value element.

6) Neural Networks:
They can perform several regressions and classification tasks at the same time, although commonly, each network performs only one (Sethi et al., 2019).
Therefore, in the vast majority of cases, the network will have a single output variable.
However, in the case of classification problems of many states, this may correspond to several output units (the post-processing stage is responsible for the assignment of output units to output variables) (Mureșan & Oltean, 2018). There is a general agreement that the K nearest neighbor algorithm is very sensitive to irrelevant characteristics: this characteristic can be explained by the way the algorithm works. Besides, the presence of irrelevant characteristics can make the training of the neural network very inefficient, even impractical. The most decision tree algorithms cannot work well with problems that require diagonal partitions (Sathya & Abraham, 2013). The division of the instance space is orthogonal to the axis of a variable and parallel to all other axes. Therefore, the resulting regions after separation are all hyper-angles. Artificial neural networks and support vector machines work well when multicollinearity is present, and there is a non-linear relationship between the input and output characteristics.
Naive Bayes (NB) requires little storage space during the training and classification stages: the strict minimum is the memory needed to store prior and conditional probabilities. The basic kNN algorithm uses a large amount of storage space for the training phase (Cao et al., 2019), and its execution space is at least as ample as its training space. On the contrary, for all non-lazy learners, the execution space is usually much smaller than the training space, since the resulting classifier is often a very condensed summary of the data. Besides, Naive

METHODOLOGY
The methodology to determine the best-supervised algorithm applied in the heart disease dataset will begin with the interpretation of the data, the preprocessing of the data, and the application of the algorithms to determine the best accuracy.

A. Dataset
The dataset used for this research will be "Heart Disease" which was found in the Kaggle repository, this database contains 76 attributes, but all published experiments refer to the use of a subset of 14 of them. In particular, the Cleveland database is the only one that ML researchers have used to date. The "goal" field refers to the presence of heart disease in the patient. It has an integer value of 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on the simple attempt to distinguish presence (values 1, 2, 3, 4) from absence (value 0) (Ray, 2018;Sethi et al., 2019;Agarwal & Sagar, 2019).

B. Interpretation of the data
Next, the data extracted is interpreted from the empirically chosen database.     From the visualization of figures 1,2,3,4, and 5 by category is possible to observe how the data are expressed, which makes it possible to detect if there is a probability of heart disease.

C. Application of algorithms
After understanding the data and interpreting the information to be generated, the following algorithms will be applied.

K Nearest Neighbors (KNN)
Because the KNN algorithm classifier predicts the class of a given test observation by identifying the observations that are closest to it, the scale of the variables is essential.
Any variable that has a large scale will have a much more significant effect on the distance between the observations than the variables that are on a small scale, and therefore on the KNN classifier (Sethi et al., 2019;Agarwal & Sagar, 2019;Cao et al., 2019;Manzoor & Singla, 2019).
After determining the training and test data with the preprocessing processes, let's use the elbow method to choose a good value of K. Here we can see the error rate after applying K = 13, let's re-enter the model with this data, and this information is reached.

Decision trees:
The data is divided into a training set and a test set, then a single decision tree will be trained, using the sklearn library, to evaluate the created decision tree.

Random Forest:
The data is preprocessed, and the training and test variables are separated to train the model.

Neural Network:
The sklearn library will be used to preprocess the data to prepare for training.

Support Vector Machines:
The data is preprocessed to apply the algorithm, the training and test variables are separated; we train the model using the sklearn library.

RESULTS
After applying the selected supervised learning algorithms to the dataset chosen for comparison, the following algorithm results are obtained.

A. K Nearest Neighbors (KNN)
To evaluate the model test data was used to find the confusion matrix, with which we can calculate the accuracy, precision, recall, and f1-score metrics, the following information is available:  Table 1 shows the average weight as 0.91, and the accuracy formula that is the sum of the real positives with the true negatives among the total population is applied, an accuracy of 45,614 is reached, and confusion matrix as :

B. Decision Trees
Applying the decision tree, we get the following results.  Table 2 shows the average weight as 0.85, and confusion matrix as:

C. Random Forest
We evaluate the random forest model according to the data already preprocessed and trained with several estimates of 100. It has an average weight of 0.81, and the confusion matrix as:

D. Neural Network
Training and test data are separated, to train the model using Keras dataset, then the model will be evaluated. Figures 7 and 8 show the models.   Table 4 shows the weight average accuracy obtained of 0.81.
The following confusion and information matrix are obtained:

E. Support Vector Machines (SVM)
The model will be evaluated according to the preprocessed data, and the following is obtained, and the report classification and matrix are:  Table 5 shows the weighted average accuracy of 0.85.

CONCLUSION
As was observed in the results, the model of k nearest neighbors has obtained better results in precision with an average accuracy of 0.91 for the heart disease dataset. For future work, other types of classification or segmentation can be applied to achieve a better prediction of the chosen dataset.

ACKNOWLEDGMENTS
This paper has been possible to carry out as research due to the need to obtain and generate knowledge from different professionals. The authors wish to thank our university mentors for their support and guidance.