Multi-Stream Human Action Recognition in Videos using Segmentation with Temporal Convolutional Network
It has become very common to detect humans and recognize the actions performed by them in videos with the help of computer vision and deep learning algorithms. Action recognition and localization in videos have applications like multimedia content analysis, video content-based retrieval, human-computer interaction, and video-surveillance. We are proposing an architecture with multiple streams for action localization and recognition. Mask RCNN segments the action regions by generating a binary mask that identifies whether a pixel is part of an object or not and clearly segments the Appearance information. The Second stream blurs the background with Median smoothing technique. Motion information from the video is extracted using Optical flow images, in the third stream. All the three streams are fused to get the final video descriptor using the Network fusion method. This framework eliminates the spatial-temporal problems occurring in Recurrent neural Networks by using a multi-stage temporal convolutional network that utilizes present and past features of video sequences. Hence, this method more accurately recognizes the actions performed by human beings.
Keywords - Action Recognition in Videos, Deep Learning, Computer Vision, Mask RCNN, TCN