NOVEL FRAMEWORK FOR HANDWRITTEN DIGIT RECOGNITION THROUGH NEURAL NETWORKS

The biggest challenge for natural language processing systems is to accurately 
identify and classify the hand–written characters. Accurate Handwritten 
Character recognition is a challenging task for humans too as the style, size 
and other handwriting parameters may vary from human to human. Though a 
relatively straightforward machine vision task but improved accuracy as compared 
to the existing implementations is still desirable. This manuscript aims to propose 
a novel neural network based framework for handwritten character recognition. 
The proposed neural network based framework, transforms the raw data set to a 
NumPy array to achieve image flattening and feeds the same into a pixel vector 
before feeding it into the network. In the neural network, the activation function 
is applied to transfer the resultant value to the hidden layer where it is further 
minimized through the use of minimized mean square and back propagation 
algorithms before applying a stochastic gradient on the resultant mini–batches. 
After a detailed study, the optimal algorithm for effective handwritten character 
recognition was proposed. Initially, the framework has been simulated only on 
digits through 50,000 training data samples, 10,000 validation data set and 
10,000 test data set, the accuracy of 96.08. 
This manuscript aims to give the reader an insight into how the proposed neural 
network based framework has been applied for handwritten digit recognition. It 
highlights the successful applications of the same while laying down the directions 
for the enhancements possible.


INTRODUCTION
Many literate humans effortlessly recognize the decimal digit set (0-9). A sample of the same is depicted in Figure 1 below. This natural attribute of mankind is actually due to the seemingly simple human brain. The human brain is a supercomputer in itself as each of its hemispheres has a visual cortex containing 140 million neurons with approximately tens of billions of connections between them . So this supercomputer tuned by evolution over hundreds of millions of years on this earth is superbly adapted to understand this complex, colourful visual world.
This masterpiece-human brain can solve a tough problem like recognizing any entity in this world in moments of seconds. The difficulty bubbles up when we attempt to automate the same task by writing a computer program and applying computer vision for character/digit recognition (Bottou, et al., 1994) (Nielsen, 2018). Recognizing digits, in particular, is a simple task as the input is simple black and white pixels with only 10 well-defined outputs. However, the accurate recognition of the handwritten shapes in different styles, fonts etc is a complex task in itself. A simple 6 has a loop on bottom and vertical or curved stroke on top which can be written in varied styles and is difficult to express algorithmically to a machine which is none other than a newborn baby born after every turn off (Bottou, et al., 1994;Nielsen, 2018;Hegen, Demuth, & Beale, 1996). Neural networks solve the above problem in a much simpler way Bottou, et al., 1994;Nielsen, 2018;Hegen, et al., 1996;Widrow, Rumelhart, & Lehr, 1994;Mishra & Singh, 2016). The strategy is to take huge data set of black and white handwritten digits from real life people and build a neural network to train from those data sets and learn to recognize those digits (Shamim, Miah, Sarker, Rana, & Jobair, 2018;Patel, Patel, & Patel, 2011;Ganea, Brezovan, & Ganea, 2005). In Neural Networks, each node performs some simple computation on the input and conveys a signal to next node through a connection having a weight and a bias associated with it which amplifies or diminishes a signal (Nielsen, 2018;Shamim, Miah, Sarker, Rana, & Jobair, 2018). Different choices for weights and bias results in different functions evaluated by networks. An appropriate learning algorithm must be used to determine the optimal values of weights and bias (Widrow, et al., 1994;Knerr, Personnaz, & Dreyfus, 1992).

The Processing Units
The processing units in a Neural Network are the smallest units just like neurons in a brain. These nodes work in a similar fashion and operate simultaneously. There is no master procedure to coordinate them all (Cardoso & Wichert, 2013). These units compute a scalar function of its input and broadcast the results to units connected to it as output. The result is called the activation value and the scalar function is called activation function (Widrow, et al., 1994).
There are three types of inputs: • Input unit, which receives data from the environment as input.
• The hidden unit, which transforms internal data of network and broadcast to the next set of units.
• Output unit, which represents a decision as the output of all system (Widrow, et al., 1994).

The Connections
The connections are essential to determine the topology of a neural network.
• Layered networks for pattern association.
• Modular networks for building complex systems.
The topology used in this paper for the proposed system is layered networks (Widrow, et al., 1994).

A Computing Procedure
Computations feed input vectors to processing units from the input layer (Sakshica & Gupta, 2015). Then the activation value of remaining units is computed synchronously or asynchronously. In a layered network, this is done by feedforward propagation method. The activation functions used are mathematical functions. The most common function is a sigmoidal function (Mishra & Singh, 2016).

A Training Procedure
Training a network implies adapting its connections according to the input environment so the network can exhibit optimized computational behaviour for all input patterns (Arel, Rose & Karnowski, 2010). The process used in this paper is modifying weights and biases with respect to the desired output. The cost of error is calculated using a mean square error method (Mishra & Singh, 2016).

DATA COLLECTION
Handwritten digit data set is vague in nature because they may not always be sharp straight lines of pixels . The main goal in digit recognition of feature extraction is to remove the ambiguity from data (Bottou, et al., 1994;Cireşan, Meier, Gambardella & Schmidhuber, 2010). It deals with extracting essential information from normalized images of isolated digits that serve as raw data in the form of vectors (Cireşan, et al., 2010). The numbers in the images can be of different sizes, styles and orientation (Patel, et al., 2011).In this study, a subset of MNIST dataset is used which contains ten thousands of scanned images of handwritten digits from 250 people. This data is divided into three parts, the first part contains 50,000 images to be used as training data. The second part is 10,000 images to be used as testing data. The third part contains 10,000 images for validation data. There are 28X28 pixels in size gray scale images. The training set, validation set and testing set are kept distinct to help neural network to learn from training set , validate the results from validation test and generate output from test set Liu, Nakashima, Sako & Fujisawa, 2003;LeCun, et al., 1995). These are an example of digits of MNIST dataset collected in different hand writings. For example -a digit 2 can be represented in a different orientation with or without a loop at the bottom or straight line or curved line at the bottom.

PROPOSED DETECTION ALGORITHM
We now discuss this feed-forward neural network that was applied in this work to achieve highest possible accuracy in handwritten digit recognition (LeCun, et al., 1989;. Above elaborates the steps of the proposed framework used in this study. The next section details this simulation. Figure 3 describes the architecture of the proposed neural network. It consists of an input layer, a hidden layer and an output layer. Each layer contains a number of neurons represented by a sigmoid function. So, the output of each neuron lies in the range of [0,1]. Every neurons output is determined by a weighted sum. jwjxj w is the weight of jthneurone having x input. The sum of the weighted sum and bias value determines the output value. The input layer consists of 784 neurons with each neuron representing each pixel value. Since each digit should be between 0and 9. So, the output layer consists of 10 neurons represented by the matrix Lauer, Suen & Bloch, 2007).  But the problem with sigmoid function is that the for negative values the speed of neural networks become slow. That is in the second quadrant of the graph.  pixel. The intensity of each pixel is the input to the neurons of the first layer. The activation function is performed on that input and the reuniting activation value is passed on to every neuron in the next layer. This same procedure is performed on the next layer. The output layer having 10 neurons So, the neuron with the highest activation value is the result. Figure 5 represents the output matrix where each row and column represents the output of the node in the output layer.

B. Gradient descent back propagation algorithm (LeCun, et al., 1990)
This algorithm helps in finding weights and biases so that output from network approximates the ideal output for all training input x. This is done by the cost function. The cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship between weights and biases (Widrow, et al., 1994;Patel, et al., 2011;Ganea, et al., 2005;Lee, 1991). , equation 2) is also called mean squared error for the cost function. is approximately equal to 0 the ideal output is equal to obtained output. So, our algorithm is doing a good job. So, this algorithm minimizes the cost function which is represented by a graph (Shamim, Miah, Sarker, Rana & Jobair, 2018;LeCun, et al., 1989).  Hence the below two equations gives the minimum cost by partial differentiating the cost with respect to weight and bias. Solving this equation through this algorithm gives two ( (Bottou, et al., 1994), equations 3) for weight and bias as: , 4: Cost with respect to weights and biases.

C. Stochastic Gradient Descent
Stochastic Gradient Descent helps in speed up the gradient descent algorithm by dividing the training input into mini batches of each size m.
Step 1: Initialize the parameters The cost function of this algorithm is dependent on two parameters that are weight and bias. So, random values are initialized to weight and bias vector for each input of neuron of each layer. The values of weights and biases are not initialized to 0 because that will be equivalent to the deletion of connections between the nodes.
Step 2: Feed-forward Algorithm In this algorithm, the output from one layer is input to the next layer. That is there are no loops and information is fed forward not backward. This is achieved by the , equation 5).

Equation 5: Feed forward equation
Here, a is the vector of activations of the second layer of neurons. To obtain a, a is multiplied by weight matrix w and vector b of biases is added. And then the function is applied element-wise to vector w.a+b. This is how inputs are fed forward. This equation gives the following results

Equation 6: Output Equation
Step 3: Calculate the gradient The gradient for output and hidden layers are obtained by updating the weights and biases. This is done by shuffling the training data and then dividing into mini batches and then updating weights and biases for each mini batch.
Step 4: Update the weights In the beginning, weights and biases are initialized some random values. Now, this model gives the output using those values which is far different from the ideal output. This gives a huge error. So, to reduce the error we will minimize the cost function and get the desired values of weight and biases. This is how back propagation works. , equation 7) is used.

EXPERIMENTAL RESULTS
The number of neurons in hidden rate and learning rate are called hyper parameters of neural network (Gattal, et al., 2016). So, changing the number of neurons and training the network again and again until the highest accuracy is achieved.  Table 1 shows the analysis of data when the network is trained with a different number of neurons in the hidden layer and the accuracy they achieved. Figure  6 shows the graph of accuracy vs a number of hidden neurons. With 65 hidden neurons, mini-batch size of 10 and learning rate of 3 the highest accuracy achieved is at 96.30 Learning rate describes how quickly or slowly the network learns. Low learning rate slows down the learning process but converges smoothly. Larger learning rate speeds up the learning but may to converge. So, the desired learning rate is decaying learning rate. Table 2 shows the analysis data with different learning rate with their accuracy.Learning Rate should not be too low or too high. If it is too low then the neural network will take time to train and if it is too high the neural network will quickly forget the previous training and adapts new training faster. Both of which are of no use.    Lee, 1991). This research paper creates experiential support to showcase the effectiveness and efficiency of the neural networks to recognize handwritten digits. The main objective of this paper is to show the highest possible accuracy achieved for recognition of handwritten digit using the technique of feed forward and back propagation in neural networks. Although research is still going on in this field in many forms of LeNetarchitecture like LeNet-1, LeNet-4, Boosted LeNet-4.The main aim of the paper is to implement one of the methods of various methods. This branch of artificial intelligence is to develop a better network for all kinds of data sets with better performance.

FUTURE WORK
Neural networks have been applied to many applications like character recognition, signature recognition, and leaf decoding. The more training samples give more accuracy of the networks. By changing the model of the network from feed-forward to convolutional network the process can be faster and accuracy can be improved although it must be tried for different activation functions. Dropout technique which is turning off part of networks layers randomly to increase the regularization and decrease over fitting. In this case, accuracy for training data is much higher than testing data. It can be improvised for colored digits and can be used for code on post cards for sorting of letters. The proposed framework in this manuscript can also be further applied to some well-defined digit sets to accurately validate them. For example, the proposed framework in this manuscript can be applied on zip code digits, university enrollment numbers or mobile numbers of a given set in a given setting to automate their machine recognition or identification.