2024年论文集（45卷）-广东省科学院智能制造研究所

钟善机　张学习　陈楚嘉　高学秋　陶杰

（广东工业大学，广东广州 510006）

摘要：针对传统的卷积神经网络在语音情感识别中无法充分捕捉时域和频域细节信息的问题，提出一种基于多尺度卷积和多头自注意力（MCNN-MHA）的语音情感识别模型。首先，通过多尺度卷积神经网络在不同尺度下对输入进行卷积操作，获得不同时域和频域上的特征；然后，引入多头自注意力机制自动学习语音信号中相关和重要的特征，并关注不同特征的子空间，增强重要特征的感知能力；最后，利用SpecAugment中的频域掩码和时域掩码来增强数据样本，提高模型的泛化性和鲁棒性。实验结果表明，MCNN-MHA模型在RAVDESS数据集上取得了90.35%的准确率。

关键词：语音情感识别；多尺度卷积神经网络；多头自注意力机制；SpecAugment

中图分类号：TN912.3文献标志码：A 文章编号：1674-2605(2024)04-0006-07

DOI：10.3969/j.issn.1674-2605.2024.04.006 开放获取

Speech Emotion Recognition Model Based on Multi-scale Convolution and Multi-head Self-attention

ZHONG Shanji ZHANG Xuexi CHEN Chujia GAO Xueqiu TAO Jie

(Guangdong University of Technology, Guangzhou 510006, China)

Abstract: A speech emotion recognition model based on multi-scale convolution and multi head self attention (MCNN-MHA) is proposed to address the problem of traditional convolutional neural networks being unable to fully capture temporal and frequency domain details in speech emotion recognition. Firstly, a multi-scale convolutional neural network is used to convolve the input at different scales, obtaining features in different time and frequency domains; Then, a multi head self attention mechanism is introduced to automatically learn relevant and important features in speech signals, and to focus on the subspaces of different features to enhance the perception ability of important features; Utilize the frequency domain mask and time domain mask in SpecAugment to enhance data samples and improve the generalization and robustness of the model. The experimental results showed that the MCNN-MHA model achieved an accuracy of 90.35% on the RAVDESS dataset.

Keywords: speech emotion recognition; multi-scale convolution neural network; multi-head self-attention mechanism; SpecAugment

期刊年表

20240406基于多尺度卷积和多头自注意力的语音情感识别模型