DEEP ARCHITECTURES FOR HUMAN ACTIVITY RECOGNITION USING SENSORS

Human activity recognition (HAR) is a renowned research field in recent years due to its applications such as physical fitness monitoring, assisted living, elderly–care, biometric authentication and many more. The ubiquitous nature of sensors makes them a good choice to use for activity recognition. The latest smart gadgets are equipped with most of the wearable sensors i.e. accelerometer, gyroscope, GPS, compass, camera, microphone etc. These sensors measure various aspects of an object, and are easy to use with less cost. The use of sensors in the field of HAR opens new avenues for machine learning (ML) researchers to accurately recognize human activities. Deep learning (DL) is becoming popular among HAR researchers due to its outstanding performance over conventional ML techniques. In this paper, we have reviewed recent research studies on deep models for sensor–based human activity recognition. The aim of this article is to identify recent trends and challenges in HAR.


INTRODUCTION
Recent years have shown significant progress in the use of smart gadgets and sensor-enabled devices. The reduced cost of these devices and ease of use makes them a perfect choice to use for human activity recognition (HAR). HAR is trending research field with its various applications including smart homes, sports, health monitoring, emergency services, and lifelogging (Chan, EstèVe, Fourniols, Escriba & Campo, 2012;Lara & Labrador, 2013). Initially, activity recognition task was successfully done through video recordings but video-based systems are location specific and it somewhat interfere one's personal life. For the reason, sensor-based activity recognition is gaining widespread acceptance. In sensor based HAR systems, low cost wearable sensors are deployed which reduces interference in daily activities. Another recent development in HAR is use of smartphones as these latest cell phones are equipped with many sensors. The unobtrusive nature of smartphones makes them appropriate for HAR.
Activity recognition systems most often use classification algorithms to classify activities as class labels. Like other time-series data, the first step in sensor based HAR is to segment data into time frames and then to extract time and frequency domain feature from those data segments. In conventional machine learning algorithms feature extraction is often done by manually using heuristic methods, in contrast deep Learning provides automatic feature extraction. It also helps in mining complex knowledge from massive amount of unsupervised data. Plötz, Hammerla, and Olivier (2011) used deep learning for the first time for feature extraction and compared the results with principal component analysis. After that a number of researchers worked on deep learning approaches for automatic feature extraction in human activity recognition (Twomey, et al., 2018;Ronao, Charissa & Cho, 2015;Alsheikh, et al., 2016). The main contribution of this research is to review latest trends in human activity recognition using deep architectures. This paper reviews and analyses recent research articles on deep learning based HAR using sensors.
The rest of the paper is organized as follows: section 2 discusses some of the deep learning architectures and section 3 elaborates some publically available datasets used for HAR. In section 4, recent studies on deep learning based HAR are presented. Section 5 presents research challenges in activity recognition field. Finally, section 6 concludes the article.

DEEP LEARNING ARCHITECTURES
Deep learning (DL) is a renowned field of research and successfully implemented in image and voice recognition problems. Generally, there are three categories of deep models, which are generative, discriminative and hybrid models (Deng & Jaitly, 2016). A generative model learns the true distribution of training data and makes some variations to generate new samples which follow same probabilistic distribution. Some generative DL methods include Restricted Boltzman Machine (RBM), Deep Autoencoders, and Sparse Coding. A discriminative method directly estimates the probability of the output given an input i.e. p(y|x) by approximating posterior distribution classes. Most commonly used discriminative models in activity recognition are Convolutional Neural Network (CNN) and Recurrent Neural Networks (RNN) (McDaniel & Quinn, 2018). Many research studies have combined discriminative and generative methods to extract more effective features. The combination of generative and discriminative models is known as hybrid model. In most studies CNN is used along with other generative or discriminative methods for HAR. This section explores some of the deep learning models used in sensor-based HAR.

CONVOLUTIONAL NEURAL NETWORK
The Convolutional Neural Network (CNN) learns internal representations of raw sensor data without domain expertise in feature engineering (Ronao & Cho, 2015). For the reason, it is most widely used method for data analysis and activity recognition. In CNN convolution operation is performed on sensor data through many hidden layers. The components of CNN include convolutional layer, pooling layer, dense (fully connected) layer and softmax layer (Ignatov, 2018). Convolutional layer detects distinct features from input by performing convolution operation on data. The first convolution layer identifies low level features whereas next convolutional layers detect higher level features (Namatēvs, 2017). The convolutional layers then introduce nonlinearities to the model through using activation functions such as tanh, sigmoid and rectified linear unit (ReLU) (Albelwi & Mahmood, 2017). Pooling layer is used to downsample the dimensionality of the feature map.It compresses features and reduces network's computational complexity (Affonso, Rossi, Vieira & Ferreira, 2017). Most frequently used pooling algorithm is max pooling which is robust to small changes (Kautz, et al., 2017). The last component of CNN is dense layers or fully connected layers. These layers are fused with softmax classifier to perform classification on extracted features. So far CNN is the most widely used deep model in activity recognition and feature learning. Zeng, et al. (2014) proposed a CNN based approach for HAR which automatically extract discriminative patterns and captures local dependencies of a sensor signal. They used partial weight sharing method to accelerometer data for performance improvement. Yang, Nguyen, San, Li, and Krishnaswamy (2015) also presented a CNN model for multichannel time-series data for HAR. The convolution and pooling layers of the proposed model capture the salient features, which are systematically unified among multiple channels and then mapped into activity classes.

RESTRICTED BOLTZMANN MACHINE
Restricted Boltzmann Machine (RBM) is a stochastic deep model, which learns a probability distribution on its input dataset using a layer of binary hidden units. The meaningful features are automatically extracted from input labelled and unlabeled data. It is most commonly used for dimensionality reduction and complex feature learning problems. It is a type of shallow neural network that learns to reconstruct data by itself in an unsupervised manner. There are two variations in RBM, one is Deep Belief Networks (DBN) and other is Deep Boltzmann Machine (DBM).
The concept of deep belief networks was first conceived by Hinton, Osindero, and Teh (2006) as a replacement of backpropagation. In terms of network structure, a DBN is very similar to multilayer perceptron but their training process is entirely different. In fact, the difference in training method is key factor that enables DBN to outperform the shallow counterpart. A deep belief network consists of multiple hidden layers. The layers are connected with each other but the units in each layer are not connected. To make learning easier the connectivity is restricted i.e. there is no connection between hidden units. DBNs can be divided in two major parts. The first one consists of multiple layers of RBMs to pre-train the network, while the second one is a feed-forward backpropagation network that will further refine the results from the RBM stack. Alsheikh, et al. (2016) proposed a DBN based model which is trained on greedy layer-wise training of RBMs. The proposed model provides better recognition accuracy of human activities and avoids expensive design of handcrafted features. Bhattacharya and Lane (2016) used RBM-based pipeline for activity recognition and have shown their approach outperforms other modeling alternatives.

AUTOENCODERS
Autoencoders are deep neural networks to perform data compression using machine learning. An autoencoder learns compressed distributed representation of input data for dimensionality reduction (Nweke, Teh, Al-Garadi & Alo, 2018). It applies back propagation i.e. the output values will be set as the input. Principle Component Analysis (PCA) does the same for linear functions whereas autoencoder can perform non-linear transformations. An autoencoder also gives a representation as to the output of each layer and having multiple representations of different dimensions is always useful. So an autoencoder uses pre-trained layers from other models to apply transfer learning to prime the encoder of the decoder. There are three components of an autoencoder; encoder, code, and decoder. The encoder compresses/encodes the input data into a latent space representation. The code represents the compressed input that is fed to the decoder. The decoder reconstructs/decodes the input from the latent space representations. Different variations in autoencoders include Sparse Autoencoder (SAE), Denoising Autoencoder (DAE) (Nweke, et al., 2018). Almaslukh, AlMuhtadi and Artoli (2017) proposed stacked autoencoder based model for better recognition accuracy along with reduced recognition time.

RECURRENT NEURAL NETWORK
Recurrent Neural Network (RNN) is a deep model with cyclic connections, which empowers it to capture correlations between time series data. RNN is successfully used in handwriting recognition and speech recognition applications (Wang, Chen, Hao, Peng & Hu, 2018). RNN is a network with a loop in it allowing information to persist. The iterative nature of RNN enables data to be passed starting with one stage of the network to the next. RNN can be considered as numerous replicas of the same network, each network passes information to the next. RNN is a very flexible and powerful network which does not require additional data labelling and works well for modelling short-term memory. This makes it a good choice to easily model sequence learning or time-related problem where the output of one layer acts as an input to the next layer. There are two variations of recurrent neural networks, one is Long Short Term Memory (LSTM) and the other is Gated Recurrent Unit (GRU). Such networks make use of different gates and memory cells to store time series sequences (Graves, 2013). Murad and Pyun (2017) used unidirectional, bidirectional and cascaded deep RNN on five public datasets. They proposed three novel LSTM-based deep RNN architectures which extract discriminative features using deep layers and provide performance improvements. M. Inoue, S. Inoue and Nishida (2018) proposed an RNN based approach to provide better recognition accuracy with reduced recognition time.

HYBRID MODELS
Hybrid models are a combination of generative and discriminative models. Many researchers have implemented hybrid models in activity recognition as well as in other fields. For instance, Murahari and Plötz (2018) used deep convolutional LSTM model to explore the temporal context in activity recognition. Lee, Grosse, Ranganath, and Ng (2009) proposed a convolutional deep belief network that used the probabilistic max-pooling technique for visual recognition tasks. In some studies, RNN and CNN are combined together where CNN captures spatial relationships and RNN uses temporal relationships. Ordóñez and Roggen (2016) presented a deep convolutional LSTM recurrent neural network for multimodal wearable sensors. The deep CNN is used for automated feature extraction and LSTM recurrent unit captures temporal dynamics of activities. Yao, Zhao, Hu, and Abdelzaher (2018) also introduced a CNN and RNN based framework which designs a self-attention module for estimating input quality by exploiting its temporal dependencies. More research in these models is expected in future. The Figure 1 shows a pie chart showing the percentage of the deep learning methods used in activity recognition. This percentage only represents deep models used in studies presented in this research article.
The mostly used deep model here is CNN which is 40%, this is due to the success of CNN in the image processing field. CNN also gives an outstanding performance in sensor-based HAR due to its discriminative feature extraction capabilities. Other models also perform well in activity recognition and are gaining popularity.

DATASETS
Validating a new human activity recognition approach on the new or self-created dataset is a challenging task. The effectiveness of such approaches can be achieved by testing them on some standard datasets where the researchers have already tested their results. This section gives a brief description of some publically available benchmark datasets which paid a remarkable contribution in HAR research. All these datasets are sensor based and are summarized in Table 1.

DISCUSSION
Although conventional machine learning algorithms have shown remarkable performance in recognition of human activities, these algorithms require domain expertise to develop robust features for high dimensional complex real-world data. However, this is time consuming and expensive task. This captivated researchers towards the use of deep learning. In deep architectures, the layers of feature representations are stacked together to extract more complex features in data. Recent studies have shown the incredible performance improvement of deep learning in HAR. Feature extraction plays a significant role in recognition process as it extracts features from sensor data which helps in reducing computational complexity and improves classification accuracy (Abidine, Fergani, Fergani & Oussalah, 2018). Conventional approaches use hand-crafted feature engineering, whereas in deep learning features are automatically learned through the deep network. Another challenge is most ML algorithms require a good amount of labelled data for model training but the data in real-time applications is mostly unlabeled. Deep learning works well with unlabeled data too (Almaslukh, et al., 2017). Table 2 provides recent research on deep learning based human activity recognition models.  (Murahari, et al., 2018) OPPORTUNITY, PAMAP2, Skoda DeepConvLSTM ADL An Attention model has been proposed in activity recognition research as a data-driven approach to explore temporal context. Attention layers have been added to DeepConvLSTM model. A semi-supervised model using a DeepLSTM based approach with temporal ensembling for activity recognition using inertial sensors. (Xi, et al., 2018) OPPORTUNITY, PAMAP2 CNN, RNN ADL They used dilated convolutional layers to automatically extract intersensor and intra-sensor features. They also proposed a novel dilated SRU (Simple Recurrent Unit) approach to capture the latent time dependencies among features. (Ignatov, 2018) WISDM, UCI HAR CNN ADL A CNN based approach to provide user-independent human activity recognition with small recognition intervals (1s) and almost no preprocessing and feature engineering required. (Inoue, et al., 2018) HASC corpus, UCI HAR RNN ADL They used an RNN based approach to provide better recognition accuracy with reduced recognition time.

(McDaniel & Quinn, 2018) UCI HAR LTSM ADL
They proposed LTSM based pipeline which can directly process raw data without extensive preprocessing and gives outstanding performance.

RESEARCH CHALLENGES
Human activity recognition is a trending research field with many challenges that need to be addressed. Although, HAR is a well-researched field still these challenges need to be further investigated for the effective realization of HAR systems. These research challenges include: Sensor placement: The position of the sensor plays an important role in recognition accuracy. Different placement positions include right/left arm, ankle, foot, hip and chest etc. Sensor signal readings vary at different positions for the same activity.
Sensor modalities: Sensor modality can be classified into wearable sensors, ambient sensors and object sensor. In most HAR systems, wearable sensors are successfully used but in only few research studies these sensor modalities are combined to improve the recognition accuracy and to infer high-level activities such as having coffee with RFID tag on the cup.
Compatibility with real-world data: Real world data is often different from laboratory data with the constrained environment. Most of real-world data come in streams and are unlabeled; therefore the HAR systems should be robust enough for real-world scenarios.
Context Awareness: As HAR systems are designed to provide analysis on user's activity and behavior, so it is necessary that the system must be aware of user's behavior, age, gender and physical condition and environment. For example, running the signal of a 75 years old patient might be equivalent to the walking of a young user. In such situations context information is vital.
Overlapping activities: Most of the HAR systems recognize single activity at a time such as walking, standing, sitting or brushing teeth, but generally there can be some overlapping activities like having coffee while watching TV or walking while drinking water. Few types of research have been done in this direction but still, there are good research opportunities in this direction.
Hyper-parameter setting: accuracy of deep models heavily rely on adjustment of network parameters such as learning rate, dropout, filter size, kernel reuse, no. of units and deep layers, regularization etc. In most of the research, these parameters are set using heuristic methods (Liu, et al., 2017), so there is a need to use optimization algorithms to adjust these hyperparameters.

Sensor Fusion:
In sensor based HAR systems, it is crucial to choose which sensors need to be fused together to improve the recognition process. Münzner, et al. (2017) presented four data fusion techniques including early fusion, sensorbased late fusion, channel-based late fusion and shared filters hybrid fusion. Chowdhury, Tjondronegoro, Chandran and Trost (2017) also presented a fusion technique namely posterior adapted class based fusion.

CONCLUSION
This article discussed recent developments in sensor-based human activity recognition using deep architectures. The goal of this article is to identify recent trends and challenges in HAR. Recent research studies on HAR are compared with respect to sensor type, the dataset used, deep learning models and its applications. Some basic deep models are also discussed, which are successfully implemented in HAR. The paper also presented some publically available sensor based datasets for activity recognition. In the end, various research challenges are discussed which may be addressed to make HAR systems more robust and implementable in real-world scenarios.