多模态特征融合的长视频行为识别方法
DOI:
作者:
作者单位:

西安建筑科技大学

作者简介:

通讯作者:

中图分类号:

基金项目:

陕西省自然科学基金面上项目(2020JM-473,2020JM-472)、西安建筑科技大学基础研究基金项目(JC1703)、西安建筑科技大学自然科学基金项目(ZR19046)


Long Video Action Recognition Method Based on Multimodal Feature Fusion
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    行为识别技术在视频检索具有重要的应用价值。针对基于卷积神经网络的行为识别方法存在的长时序行为识别能力不足、尺度特征提取困难、光照变化及复杂背景干扰等问题,提出一种多模态特征融合的长视频行为识别方法。首先,考虑到长时序行为帧间差距较小,易造成视频帧的冗余,基于此,通过均匀稀疏采样策略完成全视频段的时域建模,在降低视频帧冗余度的前提下实现长时序信息的充分保留;其次,通过多列卷积获取多尺度时空特征,弱化视角变化对视频图像带来的干扰;后引入光流数据信息,通过空间注意力机制引导的特征提取网络获取光流数据的深层次特征,进而利用不同数据模式之间的优势互补,提高网络在不同场景下的准确性和鲁棒性。最后,将获取的多尺度时空特征和光流信息在网络的全连接层进行融合,实现了端到端的长视频行为识别。实验结果表明,所提方法在UCF101和HMDB51数据集上平均精度分别为97.2%和72.8%,优于其他对比方法,实验结果证明了该方法的有效性。

    Abstract:

    Action recognition technology has important application value in video retrieval. In order to solve the problems of convolutional neural network based action recognition methods, such as insufficient ability of long time sequence action recognition, difficulty in scale feature extraction, illumination change and complex background interference, a long-video action recognition method based on multi-mode feature fusion is proposed. Firstly, considering that the gap between the frames of the long-sequence behavior is small, it is easy to cause the redundancy of the video frames. Based on this, the time-domain modeling of the whole video segment is completed by using the uniform sparse sampling strategy, and the long-sequence information is fully retained on the premise of reducing the redundancy of the video frames. Secondly, multi-column convolution is used to obtain multi-scale spatial and temporal features, so as to weaken the interference caused by the change of perspective on video images. Then, the optical flow data information is introduced, and the deep features of the optical flow data are obtained through the feature extraction network guided by the spatial attention mechanism. Furthermore, the complementary advantages among different data modes are utilized to improve the accuracy and robustness of the network in different scenarios. Finally, the obtained multi-scale spatial and temporal features and optical flow information are fused in the full connection layer of the network to realize end-to-end long video action recognition. Experimental results show that the average accuracy of the proposed method on UCF101 and HMDB51 datasets is 97.2% and 72.8%, respectively, which is better than other comparison methods. The experimental results prove the effectiveness of the method.

    参考文献
    相似文献
    引证文献
引用本文

王婷,刘光辉,张钰敏,孟月波,徐胜军.多模态特征融合的长视频行为识别方法计算机测量与控制[J].,2021,29(11):165-170.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-04-08
  • 最后修改日期:2021-05-11
  • 录用日期:2021-05-12
  • 在线发布日期: 2021-11-22
  • 出版日期:
文章二维码