Abstract:Traditional speech recognition models struggle to effectively address linguistic variations across different languages, and incomplete feature extraction leads to suboptimal recognition accuracy. To tackle this issue, we propose an improved ECAPA-TDNN-based method for language identification and classification. Building upon ECAPA-TDNN, we designed an encoder module centered on a multi-head self-attention mechanism, where parallel attention heads perform weighted aggregation of feature sequences across different time steps, capturing global dependencies between them. Meanwhile, we designed a front-end convolutional module based on multi-layer residual blocks. The convolutional layers adopt stacked convolutions and residual connections to enhance multi-level feature extraction. To align with speech signal characteristics, we implemented separate downsampling in both time and frequency dimensions, generating richer feature representations. After preprocessing the input audio to obtain FBank features, data augmentation techniques are further applied before feeding them into the optimized language recognition model. Experimental results demonstrate that the enhanced model achieves recognition accuracies of 92.91 % and 90.39 % on Common Voice and CSS10 datasets respectively, representing improvements of 5.23 % and 4.78 % over the original ECAPA-TDNN. Ablation studies conducted on both the front-end convolutional module and attention encoder module validate the effectiveness of each component. These results indicate that the proposed approach enhances the extraction of both local and global speech features over the baseline ECAPA-TDNN, significantly improving performance in language identification tasks.