Utterance-Level Speech Emotion Recognition using Parallel Convolutional Neural Network with Self- Attention Module

Session I : AI Deep Learning

간행물

한국차세대컴퓨팅학회 학술대회 바로가기
권호(발행년)

The 7th International Conference on Next Generation Computing 2021 (2021.11) 바로가기
페이지

pp.53-56
저자

Mustaqeem, Muhammad Ishaq, Guiyoung Son, Soonil Kwon
언어

영어(ENG)
URL

https://www.earticle.net/Article/A448007

영어: Automated speech emotion recognition (SER) by efficient long-term temporal context modeling is a challenging task of the digital audio signal processing domain. However, by default, the recurrent neural network (RNN) is employed to incorporate the temporal dependencies in sequence to investigate the relationships among sequences and features. In this study, we design a parallel convolutional neural network (PCNN) for SER by using a squeeze and excitation network (SEnet) with the self-attention module. Additionally, we adopt the residual learning strategy in both module, SEnet and self-attention, which is further improve the performance of the network. Our proposed SER system utilizes speech spectrogram as input and extracts utterancelevel discrete features by using the PCNN model. We experimentally evaluated our proposed system by standard speech corpus, interactive emotional dyadic motion capture (IEMOCAP). The prediction result reveals the significance and robustness of the proposed PCNN system, which obtained a high recognition rate of 72.01% over state-of-the-art (SOTA) methods.

Mustaqeem [ Interaction Technology Laboratory, Department of Software, Sejong University ]
Muhammad Ishaq [ Interaction Technology Laboratory, Department of Software, Sejong University ]
Guiyoung Son [ Interaction Technology Laboratory, Department of Software, Sejong University ]
Soonil Kwon [ Interaction Technology Laboratory, Department of Software, Sejong University ] Corresponding Author

자료제공 : 네이버학술정보

Earticle