25 Sep Human Activity Recognition at the Edge
Human Activity Recognition (HAR)—using smart electronic devices with micro-controllers—continues to gain momentum across many industries. These devices include wearables, fashion electronics, smartphone sensors, and more. HAR is being used within healthcare, education, entertainment, surveillance, sports, security systems, smart homes, and various other sectors to provide human activity and behavior analysis.
Implementation of edge computing has reduced communication latency and network traffic by moving computation closer to data sources and reducing cloud server dependency. However, resource constraints on edge devices limit high computation capacity. A combination of low latency and low computation architecture algorithms are more suitable for deployment on edge devices for HAR applications.
HAR Input Sources
HAR systems are deployed using either external devices or wearable sensors.
External devices such as monitoring devices are set at a fixed location and expected to pick up user interaction. Another example of external input for HAR is vision-based inputs that require infrastructure support, such as the installation of video cameras. In addition to being cost prohibitive, cameras cannot capture data if the user is out of their viewing angle.
Wearable sensors, such as accelerometers, gyroscopes, and magnetometers, support HAR by translating human motion into signal patterns. Recent advances in embedded sensor technology make it possible for today’s smart devices to effectively monitor the user’s activity.
Scaled Algorithms for Edge-Level Deployment
Deep learning algorithms have demonstrated high performance in HAR systems; however, these algorithms require a lot of computation making them inefficient to be deployed on edge devices. For action recognition and spatial cues as well as temporal dynamics both need to be considered. Spatial information includes pixel intensities, patterns, etc. Temporal dynamics is the relationship between past and present conditions of scenes across videos.
Deep networks such as ConvNets are incapable of modeling long-term temporal variations and usually rely on Recurrent Neural Networks or Long Short-Term Memory Networks to encode temporal information present in the scene. This approach typically requires huge amounts of computation power for only 10 or 15 frames of a video sequence. Most existing architectures for HAR are designed for trimmed videos and a trimmed sequence has actions that only last for 5 to 10 seconds. Working with a trimmed sequence is not compatible with real life use cases.
Overcoming these Challenges
A robust—edge-based—framework is required to overcome these challenges. This process is in its infancy and several research studies have reported solutions to overcome the high computational challenge with variations of deep networks on the edge with sparse sampling techniques aggregating information present in different parts of videos.
Temporal Segment Networks can also work for smaller base architectures. Light versions of deep networks can reduce memory usage during test time while maintaining comparable accuracy. Some research studies show that Shallow Recurrent Neural Network (RNN) combined with Long Short-Term Memory (LSTM) deep learning algorithms can perform well in terms of accuracy, precision, recall, f-measure, and confusion matrix.
For human activity recognition, lightweight versions of deep networks such as Inflated 3D Convnet or I3D and Temporal Segment Networks can be applied as smaller base models while maintaining acceptable levels of accuracy. These algorithms are successful in reducing memory usage which is very crucial for running it on edge devices.
Human Activities on edge devices reduce communication latency, cost, and network traffic. The deep edge networks being developed today can also support multi-sensor data which is a common requirement for edge-based solutions.