基于改进ECAPA-TDNN的语种识别方法研究
DOI:
CSTR:
作者:
作者单位:

上海大学

作者简介:

通讯作者:

中图分类号:

基金项目:

临港实验室攻关任务(LG-GG-202402-05-02)


Research on Spoken Language Identification Method Based on Improved ECAPA-TDNN
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    传统的语音识别模型难以很好地克服不同语种之间的差异性,并且特征提取不全面造成识别准确率不理想,为了解决这个问题,提出了一种基于改进ECAPA-TDNN的语种识别和分类方法;在ECAPA-TDNN的基础上,设计了以多头自注意力机制为核心的编码器模块,通过并行的注意力头对不同时间步的特征序列进行加权聚合,捕获不同时间步之间的全局依赖关系;设计了以多层残差块为核心的前端卷积模块,卷积层采取多层卷积和残差连接的方式,强化了多层次特征提取的能力,并针对语音信号特性,采取在时间和频率维度分开降采样的方式,生成更丰富的特征;在对输入音频进行预处理得到FBank特征后,进一步使用数据增强手段,再输入至优化后的语种识别模型;经过实验证明,改进后的模型在Common Voice和CSS10数据集上的识别准确率分别达到了92.91 %和90.39 %,相比ECAPA-TDNN提升了5.23 %和4.78 %,同时针对前端卷积模块和注意力编码器模块设计了消融实验,证明了各模块的有效性;上述结果表明,提出的方法在ECAPA-TDNN的基础上增强了对于语音的局部和全局特征的提取,提高了在语种识别任务上的性能。

    Abstract:

    Traditional speech recognition models struggle to effectively address linguistic variations across different languages, and incomplete feature extraction leads to suboptimal recognition accuracy. To tackle this issue, we propose an improved ECAPA-TDNN-based method for language identification and classification. Building upon ECAPA-TDNN, we designed an encoder module centered on a multi-head self-attention mechanism, where parallel attention heads perform weighted aggregation of feature sequences across different time steps, capturing global dependencies between them. Meanwhile, we designed a front-end convolutional module based on multi-layer residual blocks. The convolutional layers adopt stacked convolutions and residual connections to enhance multi-level feature extraction. To align with speech signal characteristics, we implemented separate downsampling in both time and frequency dimensions, generating richer feature representations. After preprocessing the input audio to obtain FBank features, data augmentation techniques are further applied before feeding them into the optimized language recognition model. Experimental results demonstrate that the enhanced model achieves recognition accuracies of 92.91 % and 90.39 % on Common Voice and CSS10 datasets respectively, representing improvements of 5.23 % and 4.78 % over the original ECAPA-TDNN. Ablation studies conducted on both the front-end convolutional module and attention encoder module validate the effectiveness of each component. These results indicate that the proposed approach enhances the extraction of both local and global speech features over the baseline ECAPA-TDNN, significantly improving performance in language identification tasks.

    参考文献
    相似文献
    引证文献
引用本文

李伟,姚镭,刘成.基于改进ECAPA-TDNN的语种识别方法研究计算机测量与控制[J].,2026,34(4):137-143.

复制
分享
相关视频

文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-03-31
  • 最后修改日期:2025-05-13
  • 录用日期:2025-05-15
  • 在线发布日期: 2026-04-15
  • 出版日期:
文章二维码