Main Article Content


In the interaction between humans and computers, the ability to recognize, interpret, and respond to emotions expressed in speech is needed. Until now, there is very little research for speech emotion recognition (SER) based on Indonesian. This is due to the limited corpus of Indonesian data for SER. In this study, a SER system was created by taking a dataset from an Indonesian TV series. The system is designed with the ability to carry out the process of classification of emotions, namely four classes of emotional labels angry, happy, neutral and sad. For its implementation, the deep learning method is used, which in this case the CNN method is selected. In this system the input is a combination of three features, namely MFCC, fundamental frequency, and RMSE. From the experiments that have been carried out, the best results have been obtained for the Indonesian language SER system using the MFCC input + fundamental frequency, which shows an accuracy rate of 85%. Meanwhile, the lowest accuracy when using the MFCC + RMSE feature is 72%. From this initial study, it is hoped that it will be able to provide an overview for researchers in the SER field, about how to select speech signal features as input in testing and make it easier for the steps to develop their research.


Speech Emotion Recognition (SER), CNN, deep learning Speech Emotion Recognition (SER) CNN Deep Learning

Article Details

How to Cite
Aini, Y. K., Santoso, T. B. ., & Dutono, T. (2021). Pemodelan CNN Untuk Deteksi Emosi Berbasis Speech Bahasa Indonesia. Jurnal Komputer Terapan, 7(1), 143–152.


  1. C.M. Lee, S.S. Narayanan, “Toward Detecting Emotions in Spoken Dialogsâ€, IEEE Trans, Speech Audio Process, 13(2), 293–303 , 2005.
  2. D. Tacconi, O. Mayora, P. Lukowicz, B. Arnrich, C. Setz, G. Troster, C. Haring, “Activity and Emotion Recognition to Support Early Diagnosis of Psychiatric Diseasesâ€, Second International Conference on Pervasive Computing Technologies for Healthcare, pp. 100–102, 2008.
  3. S. Yildirim, S. Narayanan, A. Potamianos, “Detecting Emotional State of a Child in a Conversational Computer Gameâ€. Comput. Speech Lang. 25(1), 29–44 , 2011.
  4. D. Ververidis, C. Kotropoulos, “Emotional speech recognition: resources, features, and methodsâ€, Speech Commun. 48 (9), 1162–1181, 2006.
  5. D.Neiberg, K. Elenius, “Automatic Recognition of Anger in Spontaneous Speechâ€, INTERSPEECH 2008, Brisbane, Australia, pp. 2755–2758, 2008.
  6. Alex, S. Ben, Mary, L., & Babu, B. P. “Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Featuresâ€, Circuits, Systems, and Signal Processing, 39(11), 5681–5709, 2020.
  7. Mirsamadi, S., Barsoum, E., & Zhang, C., “Automatic Speech Emotion Recognition Using Recurrent Neural Networks With Local Attention Center for Robust Speech Systems†, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2227–2231, 2017.
  8. Mustaqeem, Sajjad, M., & Kwon, S. ,“Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTMâ€. IEEE Access, 8, 79861–79875, 2020.
  9. Sun, T. W., “End-to-End Speech Emotion Recognition with Gender Informationâ€. IEEE Access, 8, 152423–152438., 2020.
  10. Hamidi, Mina., “Emotion Recognition from Persian Speech with Neural Network.â€, International Journal of Artificial Intelligence & Applications. 3. 107-112, 2012.
  11. Hamsa, S., Shahin, I., Iraqi, Y., & Werghi, N., “Emotion Recognition from Speech Using Wavelet Packet Transform Cochlear Filter Bank and Random Forest Classifier.â€, IEEE Access, 8, 96994–97006, 2020.
  12. Fahmi, F., Jiwanggi, M. A., & Adriani, M. , “Speech-Emotion Detection in an Indonesian Movieâ€, Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL), May, 185–193, 2020.
  13. Cong, P.; Wang, C.; Ren, Z.; Wang, H.; Wang, Y.; Feng, J. “Unsatisï¬ed customer call detection with deep learningâ€, In Proceedings of the 2016 10th International Symposium on Chinese Spoken Language Processing(ISCSLP), Tianjin, China, 17–20; pp. 1–5, 2016.
  14. Livingstone, S., & Russo, F. “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) “. In PLoS ONE (Vol. 13), 2018.