Abstract:Action recognition technology has important application value in video retrieval. In order to solve the problems of convolutional neural network based action recognition methods, such as insufficient ability of long time sequence action recognition, difficulty in scale feature extraction, illumination change and complex background interference, a long-video action recognition method based on multi-mode feature fusion is proposed. Firstly, considering that the gap between the frames of the long-sequence behavior is small, it is easy to cause the redundancy of the video frames. Based on this, the time-domain modeling of the whole video segment is completed by using the uniform sparse sampling strategy, and the long-sequence information is fully retained on the premise of reducing the redundancy of the video frames. Secondly, multi-column convolution is used to obtain multi-scale spatial and temporal features, so as to weaken the interference caused by the change of perspective on video images. Then, the optical flow data information is introduced, and the deep features of the optical flow data are obtained through the feature extraction network guided by the spatial attention mechanism. Furthermore, the complementary advantages among different data modes are utilized to improve the accuracy and robustness of the network in different scenarios. Finally, the obtained multi-scale spatial and temporal features and optical flow information are fused in the full connection layer of the network to realize end-to-end long video action recognition. Experimental results show that the average accuracy of the proposed method on UCF101 and HMDB51 datasets is 97.2% and 72.8%, respectively, which is better than other comparison methods. The experimental results prove the effectiveness of the method.