MUSIC EMOTION RECOGNITION AND CLASSIFICATION USING HYBRID CNN-LSTM DEEP NEURAL NETWORK
Abstract
In music information retrieval (MIR), emotion-based classification is a complex and challenging task for researchers. In modern-day information technology, understanding music emotions through human-computer interaction plays a vital role in capturing the attention of both researchers and the music industry. This paper presents a learning algorithm by adopting the decision tree, random forest, k-nearest neighbors, multi-layer perceptron (MLP), long short-term memory (LSTM) neural network, convolutional neural networks (CNNs), and CNN-LSTM hybrid deep learning approaches with relevant feature extraction techniques. Many researchers have performed emotion recognition in music for different languages such as English, Chinese, Spanish, Turkish, Hindi, etc. However, languages like Assamese have drawn very little attention in the research of music emotion recognition (MER). This work aims to perform a novel approach to emotion recognition in Assamese songs. In this study, a newly created Assamese dataset of 200 song samples is used with four different emotions and another dataset used is the RAVDESS emotional song database which consists of 1012 song samples with six different music emotions. Relevant features such as mel-frequency cepstrum coefficients (MFCC), mel spectrogram, and chroma features are extracted from the song samples to investigate the performance of the proposed method. A comparative analysis using different classifiers is carried out and the findings of this study suggest that the CNN-LSTM model has shown better accuracy with both datasets. The accuracy is 85.00% with the Assamese dataset, and with the RAVDESS dataset, the accuracy is 89.66% compared to the other classifiers used in this work.
JEL Classification Codes: C880, Y100, Y800.
References
Agga, A., Abbou, A., Labbadi, M., Houm, Y. E., & Ali, I. H. O. (2022). CNN-LSTM: An efficient hybrid deep learning architecture for predicting short-term photovoltaic power production. Electric Power Systems Research, 208, 107908. https://doi.org/10.1016/j.epsr.2022.107908
Aziz, M. N. (2020). A Review on Artificial Neural Networks and its’ Applicability. Bangladesh Journal of Multidisciplinary Scientific Research, 2(1), 48-51. https://doi.org/10.46281/bjmsr.v2i1.609
Aksan, F., Li, Y., Suresh, V., & Janik, P. (2023). CNN-LSTM vs. LSTM-CNN to Predict Power Flow Direction: A Case Study of the High-Voltage Subnet of Northeast Germany. Sensors, 23(2), 901. https://doi.org/10.3390/s23020901
Aljanaki, A., Yang, Y. H., & Soleymani, M. (2017). Developing a benchmark for emotional analysis of music. PloS One, 12(3), e0173392. https://doi.org/10.1371/journal.pone.0173392
Bhatkar, A. P., & Kharat, G. U. (2015, December). Detection of diabetic retinopathy in retinal images using MLP classifier. In 2015 IEEE international symposium on nanoelectronic and information systems (pp. 331-335). IEEE. https://doi.org/10.1109/inis.2015.30
Chen, Y. A., Yang, Y. H., Wang, J. C., & Chen, H. (2015, April). The AMG1608 dataset for music emotion recognition. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 693-697). IEEE. https://doi.org/10.1109/icassp.2015.7178058
Christy, A., Vaithyasubramanian, S., Jesudoss, A., & Praveena, M. D. A. (2020). Multimodal speech emotion recognition and classification using convolutional neural network techniques. International Journal of Speech Technology, 23(2), 381–388. https://doi.org/10.1007/s10772-020-09713-y
De Benito-Gorron, D., Lozano-Diez, A., Toledano, D. T., & Gonzalez-Rodriguez, J. (2019). Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset. EURASIP Journal on Audio, Speech and Music Processing, 2019(1), 1-18. https://doi.org/10.1186/s13636-019-0152-1
Delbouys, R., Hennequin, R., Piccoli, F., Royo-Letelier, J., & Moussallam, M. (2018). Music mood detection based on audio and lyrics with deep neural net. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1809.07276
Deng, J. J., & Leung, C. H. C. (2012). Music emotion retrieval based on acoustic features. In Lecture notes in electrical engineering (pp. 169–177). https://doi.org/10.1007/978-3-642-28744-2_22
Domínguez-Jiménez, J., Campo-Landines, K., Martínez-Santos, J., Delahoz, E., & Contreras-Ortiz, S. (2020). A machine learning model for emotion recognition from physiological signals. Biomedical Signal Processing and Control, 55, 101646. https://doi.org/10.1016/j.bspc.2019.101646
Ellis, D. P., & Poliner, G. E. (2007, April). Identifyingcover songs' with chroma features and dynamic programming beat tracking. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07 (Vol. 4, pp. IV-1429). IEEE. https://doi.org/10.1109/icassp.2007.367348
Er, M. B., & Esin, E. M. (2021). Music emotion recognition with machine learning based on audio features. Computer Science, 6(3), 133-144. https://doi.org/10.53070/bbd.945894
Farooq, M., Hussain, F., Baloch, N. K., Raja, F. R., Yu, H., & Zikria, Y. B. (2020). Impact of feature selection Algorithm on speech emotion Recognition using deep convolutional neural Network. Sensors, 20(21), 6008. https://doi.org/10.3390/s20216008
Han, X., Chen, F., & Ban, J. (2023). Music Emotion Recognition Based on a Neural Network with an Inception-GRU Residual Structure. Electronics, 12(4), 978. https://doi.org/10.3390/electronics12040978
He, N., & Ferguson, S. (2022). Music emotion recognition based on segment-level two-stage learning. International Journal of Multimedia Information Retrieval, 11(3), 383–394. https://doi.org/10.1007/s13735-022-00230-z
Hizlisoy, S., Yildirim, S., & Tufekci, Z. (2021). Music emotion recognition using convolutional long short term memory deep neural networks. Engineering Science and Technology, an International Journal, 24(3), 760–767. https://doi.org/10.1016/j.jestch.2020.10.009
Iversen, A., Taylor, N., & Brown, K. (2006). Classification and verification through the combination of the multi-layer perceptron and auto-association neural networks. Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. https://doi.org/10.1109/ijcnn.2005.1556018
Jamdar, A., Abraham, J., Khanna, K., & Dubey, R. (2015). Emotion analysis of songs based on lyrical and audio features. International Journal of Artificial Intelligence and Applications, 6(3), 35–50. https://doi.org/10.5121/ijaia.2015.6304
Jitendra, M., & Radhika, Y. (2021). An automated music recommendation system based on listener preferences. In Recent Trends in Intensive Computing (pp. 80-87). IOS Press. https://doi.org/10.3233/apc210182
Kaya, E. M., Huang, N., & Elhilali, M. (2020). Pitch, timbre and intensity interdependently modulate neural responses to salient sounds. Neuroscience, 440, 1–14. https://doi.org/10.1016/j.neuroscience.2020.05.018
Kim, Y., Lee, H., & Provost, E. M. (2013, May). Deep learning for robust feature generation in audiovisual emotion recognition. In 2013 IEEE international conference on acoustics, speech and signal processing (pp. 3687-3691). IEEE. https://doi.org/10.1109/icassp.2013.6638346
Liu, H., Mi, X. W., & Li, Y. F. (2018). Wind speed forecasting method based on deep learning strategy using empirical wavelet transform, long short term memory neural network and Elman neural network. Energy Conversion and Management, 156, 498–514. https://doi.org/10.1016/j.enconman.2017.11.053
Liu, X., Chen, Q., Wu, X., Liu, Y., & Yang, L. (2017). CNN based music emotion classification. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1704.05665
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS One, 13(5), e0196391. https://doi.org/10.1371/journal.pone.0196391
Masood, S., Nayal, J. S., & Jain, R. K. (2016, July). Singer identification in Indian Hindi songs using MFCC and spectral features. In 2016 IEEE 1st International Conference on Power Electronics, Intelligent Control and Energy Systems (ICPEICES) (pp. 1-5). IEEE. https://doi.org/10.1109/icpeices.2016.7853641
Modran, H. A., Chamunorwa, T., Ursuțiu, D., Samoilă, C., & Hedeșiu, H. (2023). Using deep learning to recognize therapeutic effects of music based on emotions. Sensors, 23(2), 986. https://doi.org/10.3390/s23020986
Murthy, Y. V., Jeshventh, T. K. R., Zoeb, M., Saumyadip, M., & Shashidhar, G. K. (2018, August). Singer identification from smaller snippets of audio clips using acoustic features and DNNs. In 2018 eleventh international conference on contemporary computing (IC3) (pp. 1-6). IEEE. https://doi.org/10.1109/ic3.2018.8530602
Mustaqeem, N., & Kwon, S. (2019). A CNN-Assisted enhanced audio signal processing for speech emotion recognition. Sensors, 20(1), 183. https://doi.org/10.3390/s20010183
Panda, R., Malheiro, R., & Paiva, R. P. (2020). Novel audio features for music emotion recognition. IEEE Transactions on Affective Computing, 11(4), 614–626. https://doi.org/10.1109/taffc.2018.2820691
Panda, R., Rocha, B., & Paiva, R. P. (2015). Music Emotion Recognition with Standard and Melodic Audio Features. Applied Artificial Intelligence, 29(4), 313–334. https://doi.org/10.1080/08839514.2015.1016389
Patra, B. G., Das, D., & Bandyopadhyay, S. (2016). Labeling data and developing supervised framework for Hindi music mood analysis. Journal of Intelligent Information Systems, 48(3), 633–651. https://doi.org/10.1007/s10844-016-0436-1
Patra, B. G., Das, D., & Bandyopadhyay, S. (2018). Multimodal mood classification of Hindi and Western songs. Journal of Intelligent Information Systems, 51(3), 579–596. https://doi.org/10.1007/s10844-018-0497-4
Qing, X., & Niu, Y. (2018). Hourly day-ahead solar irradiance prediction using weather forecasts by LSTM. Energy, 148, 461–468. https://doi.org/10.1016/j.energy.2018.01.177
Russell, J. A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161–1178. https://doi.org/10.1037/h0077714
Schmidt, E. M., Turnbull, D., & Kim, Y. E. (2010, March). Feature selection for content-based, time-varying musical emotion regression. In Proceedings of the international conference on Multimedia information retrieval (pp. 267-274). https://doi.org/10.1145/1743384.1743431
Soleymani, M., Caro, M. N., Schmidt, E. M., Sha, C. Y., & Yang, Y. H. (2013, October). 1000 songs for emotional analysis of music. In Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia (pp. 1-6). https://doi.org/10.1145/2506364.2506365
Tasdelen, A., & Sen, B. (2021). A hybrid CNN-LSTM model for pre-miRNA classification. Scientific reports, 11(1), 14125. https://doi.org/10.1038/s41598-021-93656-0
Weninger, F., Eyben, F., & Schuller, B. (2014, May). On-line continuous-time music mood regression with deep recurrent neural networks. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5412-5416). IEEE. https://doi.org/10.1109/icassp.2014.6854637
Yang, J. (2021). A novel music emotion recognition model using neural network technology. Frontiers in psychology, 12, 760060. https://doi.org/10.3389/fpsyg.2021.760060
Zhang, F., Meng, H., & Li, M. (2016, August). Emotion extraction and recognition from music. In 2016 12th international conference on natural computation, fuzzy systems and knowledge discovery (icnc-fskd) (pp. 1728-1733). IEEE. https://doi.org/10.1109/fskd.2016.7603438
Zhou, Q., Shan, J., Ding, W., Wang, C., Yuan, S., Sun, F., ... & Fang, B. (2021). Cough recognition based on mel-spectrogram and convolutional neural network. Frontiers in Robotics and AI, 8, 580080. https://doi.org/10.3389/frobt.2021.580080
Zuber, S., & Vidhya, K. (2022, July). Detection and analysis of emotion recognition from speech signals using Decision Tree and comparing with Support Vector Machine. In 2022 International Conference on Innovative Computing, Intelligent Communication and Smart Electrical Systems (ICSES) (pp. 1-5). IEEE. https://doi.org/10.1109/icses55317.2022.9914046
Copyright (c) 2024 Jumpi Dutta, Dipankar Chanda
This work is licensed under a Creative Commons Attribution 4.0 International License.