基于SlowFast网络的视频连续动态手语识别算法
DOI:
CSTR:
作者:
作者单位:

西安翻译学院

作者简介:

通讯作者:

中图分类号:

基金项目:

2024年度陕西省教育厅一般专项科学研究计划项目(项目编号:24JK0457)


Video Continuous Dynamic Sign Language Recognition Algorithm Based on SlowFast Network
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    连续动态手语动作包含丰富的空间信息(如手部形状、位置)和时间信息(如动作顺序、速度),导致空间和时间特征的冗余度较高,降低了手语识别的准确性。因此,提出基于SlowFast网络的视频连续动态手语识别算法。通过双相机立体视觉系统拍摄手语视频并校正图像,利用优化排序的Hough梯度法检测关节点特征。采用基于仿射变换的马氏距离算法匹配立体对应点,应用金字塔光流的动态线性模型法实现关节点的连续跟踪。设计强化版SlowFast网络架构,通过双路径分别捕捉手语视频的空间语义和时间动态特征,并融合时空信息。利用注意力机制和关键帧提取方法,结合改进的损失函数,完成连续手语的动态识别。SlowFast网络通过独特的双路径架构和时间-空间特征提取能力,可以有效解决时间-空间特征冗余度较高的问题。实验结果表明,基于SlowFast网络的方法在平均端点误差(EPE mean)测试中表现最佳,误差较低,同时其时间-空间特征冗余度最高不超过0.50,在视频连续动态手语识别中展现出更高的准确性和稳定性。

    Abstract:

    Continuous dynamic sign language actions contain rich spatial information (such as hand shape and position) and temporal information (such as action sequence and speed), resulting in high redundancy of spatial and temporal features and reducing the accuracy of sign language recognition. Therefore, a video continuous dynamic sign language recognition algorithm based on SlowFast network is proposed. Capture sign language videos and correct images using a dual camera stereo vision system, and detect joint features using the optimized Hough gradient method. Using the Mahalanobis distance algorithm based on affine transformation to match stereo corresponding points, and applying the dynamic linear model method of pyramid optical flow to achieve continuous tracking of joint points. Design an enhanced SlowFast network architecture that captures spatial semantics and temporal dynamic features of sign language videos through dual paths, and integrates spatiotemporal information. By utilizing attention mechanisms and keyframe extraction methods, combined with an improved loss function, dynamic recognition of continuous sign language is achieved. The SlowFast network, with its unique dual path architecture and ability to extract temporal spatial features, can effectively solve the problem of high redundancy in temporal spatial features. The experimental results show that the method based on SlowFast network performs the best in the average endpoint error (EPE mean) test, with low error and a maximum spatiotemporal feature redundancy of no more than 0.50. It demonstrates higher accuracy and stability in video continuous dynamic sign language recognition.

    参考文献
    相似文献
    引证文献
引用本文

包艳艳,尤国强.基于SlowFast网络的视频连续动态手语识别算法计算机测量与控制[J].,2026,34(5):223-231.

复制
分享
相关视频

文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-05-23
  • 最后修改日期:2025-07-15
  • 录用日期:2025-07-17
  • 在线发布日期: 2026-05-26
  • 出版日期:
文章二维码