MUSIC EMOTION RECOGNITION AND CLASSIFICATION USING HYBRID CNN-LSTM DEEP NEURAL NETWORK

  • Jumpi Dutta Research Scholar & Corresponding Author, Department of Electrical Engineering, Assam Engineering College, Guwahati, Assam, India https://orcid.org/0009-0008-8265-9242
  • Dipankar Chanda Professor, Department of Electrical Engineering, Assam Engineering College, Guwahati, Assam, India https://orcid.org/0009-0007-3704-716X
Keywords: Emotion Recognition, Assamese, Feature Extraction, Mel Frequency Cepstrum Coefficient, Deep Learning, Long Short-Term Memory Network, Convolutional Neural Network.

Abstract

In music information retrieval (MIR), emotion-based classification is a complex and challenging task for researchers. In modern-day information technology, understanding music emotions through human-computer interaction plays a vital role in capturing the attention of both researchers and the music industry. This paper presents a learning algorithm by adopting the decision tree, random forest, k-nearest neighbors, multi-layer perceptron (MLP), long short-term memory (LSTM) neural network, convolutional neural networks (CNNs), and CNN-LSTM hybrid deep learning approaches with relevant feature extraction techniques. Many researchers have performed emotion recognition in music for different languages such as English, Chinese, Spanish, Turkish, Hindi, etc. However, languages like Assamese have drawn very little attention in the research of music emotion recognition (MER). This work aims to perform a novel approach to emotion recognition in Assamese songs. In this study, a newly created Assamese dataset of 200 song samples is used with four different emotions and another dataset used is the RAVDESS emotional song database which consists of 1012 song samples with six different music emotions. Relevant features such as mel-frequency cepstrum coefficients (MFCC), mel spectrogram, and chroma features are extracted from the song samples to investigate the performance of the proposed method. A comparative analysis using different classifiers is carried out and the findings of this study suggest that the CNN-LSTM model has shown better accuracy with both datasets. The accuracy is 85.00% with the Assamese dataset, and with the RAVDESS dataset, the accuracy is 89.66% compared to the other classifiers used in this work.

JEL Classification Codes: C880, Y100, Y800.

References

Agga, A., Abbou, A., Labbadi, M., Houm, Y. E., & Ali, I. H. O. (2022). CNN-LSTM: An efficient hybrid deep learning architecture for predicting short-term photovoltaic power production. Electric Power Systems Research, 208, 107908. https://doi.org/10.1016/j.epsr.2022.107908

Aziz, M. N. (2020). A Review on Artificial Neural Networks and its’ Applicability. Bangladesh Journal of Multidisciplinary Scientific Research, 2(1), 48-51. https://doi.org/10.46281/bjmsr.v2i1.609

Aksan, F., Li, Y., Suresh, V., & Janik, P. (2023). CNN-LSTM vs. LSTM-CNN to Predict Power Flow Direction: A Case Study of the High-Voltage Subnet of Northeast Germany. Sensors, 23(2), 901. https://doi.org/10.3390/s23020901

Aljanaki, A., Yang, Y. H., & Soleymani, M. (2017). Developing a benchmark for emotional analysis of music. PloS One, 12(3), e0173392. https://doi.org/10.1371/journal.pone.0173392

Bhatkar, A. P., & Kharat, G. U. (2015, December). Detection of diabetic retinopathy in retinal images using MLP classifier. In 2015 IEEE international symposium on nanoelectronic and information systems (pp. 331-335). IEEE. https://doi.org/10.1109/inis.2015.30

Chen, Y. A., Yang, Y. H., Wang, J. C., & Chen, H. (2015, April). The AMG1608 dataset for music emotion recognition. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 693-697). IEEE. https://doi.org/10.1109/icassp.2015.7178058

Christy, A., Vaithyasubramanian, S., Jesudoss, A., & Praveena, M. D. A. (2020). Multimodal speech emotion recognition and classification using convolutional neural network techniques. International Journal of Speech Technology, 23(2), 381–388. https://doi.org/10.1007/s10772-020-09713-y

De Benito-Gorron, D., Lozano-Diez, A., Toledano, D. T., & Gonzalez-Rodriguez, J. (2019). Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset. EURASIP Journal on Audio, Speech and Music Processing, 2019(1), 1-18. https://doi.org/10.1186/s13636-019-0152-1

Delbouys, R., Hennequin, R., Piccoli, F., Royo-Letelier, J., & Moussallam, M. (2018). Music mood detection based on audio and lyrics with deep neural net. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1809.07276

Deng, J. J., & Leung, C. H. C. (2012). Music emotion retrieval based on acoustic features. In Lecture notes in electrical engineering (pp. 169–177). https://doi.org/10.1007/978-3-642-28744-2_22

Domínguez-Jiménez, J., Campo-Landines, K., Martínez-Santos, J., Delahoz, E., & Contreras-Ortiz, S. (2020). A machine learning model for emotion recognition from physiological signals. Biomedical Signal Processing and Control, 55, 101646. https://doi.org/10.1016/j.bspc.2019.101646

Ellis, D. P., & Poliner, G. E. (2007, April). Identifyingcover songs' with chroma features and dynamic programming beat tracking. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07 (Vol. 4, pp. IV-1429). IEEE. https://doi.org/10.1109/icassp.2007.367348

Er, M. B., & Esin, E. M. (2021). Music emotion recognition with machine learning based on audio features. Computer Science, 6(3), 133-144. https://doi.org/10.53070/bbd.945894

Farooq, M., Hussain, F., Baloch, N. K., Raja, F. R., Yu, H., & Zikria, Y. B. (2020). Impact of feature selection Algorithm on speech emotion Recognition using deep convolutional neural Network. Sensors, 20(21), 6008. https://doi.org/10.3390/s20216008

Han, X., Chen, F., & Ban, J. (2023). Music Emotion Recognition Based on a Neural Network with an Inception-GRU Residual Structure. Electronics, 12(4), 978. https://doi.org/10.3390/electronics12040978

He, N., & Ferguson, S. (2022). Music emotion recognition based on segment-level two-stage learning. International Journal of Multimedia Information Retrieval, 11(3), 383–394. https://doi.org/10.1007/s13735-022-00230-z

Hizlisoy, S., Yildirim, S., & Tufekci, Z. (2021). Music emotion recognition using convolutional long short term memory deep neural networks. Engineering Science and Technology, an International Journal, 24(3), 760–767. https://doi.org/10.1016/j.jestch.2020.10.009

Iversen, A., Taylor, N., & Brown, K. (2006). Classification and verification through the combination of the multi-layer perceptron and auto-association neural networks. Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. https://doi.org/10.1109/ijcnn.2005.1556018

Jamdar, A., Abraham, J., Khanna, K., & Dubey, R. (2015). Emotion analysis of songs based on lyrical and audio features. International Journal of Artificial Intelligence and Applications, 6(3), 35–50. https://doi.org/10.5121/ijaia.2015.6304

Jitendra, M., & Radhika, Y. (2021). An automated music recommendation system based on listener preferences. In Recent Trends in Intensive Computing (pp. 80-87). IOS Press. https://doi.org/10.3233/apc210182

Kaya, E. M., Huang, N., & Elhilali, M. (2020). Pitch, timbre and intensity interdependently modulate neural responses to salient sounds. Neuroscience, 440, 1–14. https://doi.org/10.1016/j.neuroscience.2020.05.018

Kim, Y., Lee, H., & Provost, E. M. (2013, May). Deep learning for robust feature generation in audiovisual emotion recognition. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 3687-3691). IEEE. https://doi.org/10.1109/icassp.2013.6638346

Liu, H., Mi, X. W., & Li, Y. F. (2018). Wind speed forecasting method based on deep learning strategy using empirical wavelet transform, long short term memory neural network and Elman neural network. Energy Conversion and Management, 156, 498–514. https://doi.org/10.1016/j.enconman.2017.11.053

Liu, X., Chen, Q., Wu, X., Liu, Y., & Yang, L. (2017). CNN based music emotion classification. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1704.05665

Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS One, 13(5), e0196391. https://doi.org/10.1371/journal.pone.0196391

Masood, S., Nayal, J. S., & Jain, R. K. (2016, July). Singer identification in Indian Hindi songs using MFCC and spectral features. In 2016 IEEE 1st International Conference on Power Electronics, Intelligent Control and Energy Systems (ICPEICES) (pp. 1-5). IEEE. https://doi.org/10.1109/icpeices.2016.7853641

Modran, H. A., Chamunorwa, T., Ursuțiu, D., Samoilă, C., & Hedeșiu, H. (2023). Using deep learning to recognize therapeutic effects of music based on emotions. Sensors, 23(2), 986. https://doi.org/10.3390/s23020986

Murthy, Y. V., Jeshventh, T. K. R., Zoeb, M., Saumyadip, M., & Shashidhar, G. K. (2018, August). Singer identification from smaller snippets of audio clips using acoustic features and DNNs. In 2018 eleventh international conference on contemporary computing (IC3) (pp. 1-6). IEEE. https://doi.org/10.1109/ic3.2018.8530602

Mustaqeem, N., & Kwon, S. (2019). A CNN-Assisted enhanced audio signal processing for speech emotion recognition. Sensors, 20(1), 183. https://doi.org/10.3390/s20010183

Panda, R., Malheiro, R., & Paiva, R. P. (2020). Novel audio features for music emotion recognition. IEEE Transactions on Affective Computing, 11(4), 614–626. https://doi.org/10.1109/taffc.2018.2820691

Panda, R., Rocha, B., & Paiva, R. P. (2015). Music Emotion Recognition with Standard and Melodic Audio Features. Applied Artificial Intelligence, 29(4), 313–334. https://doi.org/10.1080/08839514.2015.1016389

Patra, B. G., Das, D., & Bandyopadhyay, S. (2016). Labeling data and developing supervised framework for Hindi music mood analysis. Journal of Intelligent Information Systems, 48(3), 633–651. https://doi.org/10.1007/s10844-016-0436-1

Patra, B. G., Das, D., & Bandyopadhyay, S. (2018). Multimodal mood classification of Hindi and Western songs. Journal of Intelligent Information Systems, 51(3), 579–596. https://doi.org/10.1007/s10844-018-0497-4

Qing, X., & Niu, Y. (2018). Hourly day-ahead solar irradiance prediction using weather forecasts by LSTM. Energy, 148, 461–468. https://doi.org/10.1016/j.energy.2018.01.177

Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178. https://doi.org/10.1037/h0077714

Schmidt, E. M., Turnbull, D., & Kim, Y. E. (2010, March). Feature selection for content-based, time-varying musical emotion regression. In Proceedings of the international conference on Multimedia information retrieval (pp. 267-274). https://doi.org/10.1145/1743384.1743431

Soleymani, M., Caro, M. N., Schmidt, E. M., Sha, C. Y., & Yang, Y. H. (2013, October). 1000 songs for emotional analysis of music. In Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia (pp. 1-6). https://doi.org/10.1145/2506364.2506365

Tasdelen, A., & Sen, B. (2021). A hybrid CNN-LSTM model for pre-miRNA classification. Scientific reports, 11(1), 14125. https://doi.org/10.1038/s41598-021-93656-0

Weninger, F., Eyben, F., & Schuller, B. (2014, May). On-line continuous-time music mood regression with deep recurrent neural networks. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5412-5416). IEEE. https://doi.org/10.1109/icassp.2014.6854637

Yang, J. (2021). A novel music emotion recognition model using neural network technology. Frontiers in psychology, 12, 760060. https://doi.org/10.3389/fpsyg.2021.760060

Zhang, F., Meng, H., & Li, M. (2016, August). Emotion extraction and recognition from music. In 2016 12th international conference on natural computation, fuzzy systems and knowledge discovery (icnc-fskd) (pp. 1728-1733). IEEE. https://doi.org/10.1109/fskd.2016.7603438

Zhou, Q., Shan, J., Ding, W., Wang, C., Yuan, S., Sun, F., ... & Fang, B. (2021). Cough recognition based on mel-spectrogram and convolutional neural network. Frontiers in Robotics and AI, 8, 580080. https://doi.org/10.3389/frobt.2021.580080

Zuber, S., & Vidhya, K. (2022, July). Detection and analysis of emotion recognition from speech signals using Decision Tree and comparing with Support Vector Machine. In 2022 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES) (pp. 1-5). IEEE. https://doi.org/10.1109/icses55317.2022.9914046

Published
2024-08-03
How to Cite
Dutta, J., & Chanda, D. (2024). MUSIC EMOTION RECOGNITION AND CLASSIFICATION USING HYBRID CNN-LSTM DEEP NEURAL NETWORK. Bangladesh Journal of Multidisciplinary Scientific Research, 9(3), 21-32. https://doi.org/10.46281/bjmsr.v9i3.2230
Section
Research Paper/Theoretical Paper/Review Paper/Short Communication Paper