Deep learning Multimodal feature enhanced with cross-modal attention for emotion recognition
Keywords:
Multimodal Emotion Recognition, Deep Learning Multimodal Feature Extraction, Cross-Modal Attention, Multivariate LSTMAbstract
Human emotion recognition is becoming very important in human-computer interaction applications and stress management applications. In these applications, multimodal features extracted from various modalities like visual, speech, text, and biomedical sensor readings etc. to classify various emotions. For effective emotion recognition there has been extensive research on various modalities and their features which are to be considered for classification of emotions. This work explores multiple modalities and proposes a novel deep learning based multimodal features which are enhanced by using cross modality attention for recognition of basic human emotions The proposed solution provides higher discriminative ability in three cases of emotions like basic emotion, activation of emotion (positive, negative) and arousal of emotion (high, low).
Downloads
Metrics
References
Mehrabian and S. R. Ferris, “Inference of attitudes from nonverbal communication in two channels.” J. Consult. Psychol., vol. 31, no. 3, pp. 248, 1967.
B.Chen, Q. Cao, M. Hou, Z. Zhang, G. Lu and D. Zhang, "Multimodal Emotion Recognition With Temporal and Semantic Consistency," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3592-3603, 2021.
Y. Park, N. Cha, S. Kang, et al., “K-EmoCon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations.” Scientific Data, vol. 7, 2020.
Dai, Wenliang, Samuel Cahyawijaya, Yejin Bang, and Pascale Fung. "Weakly supervised Multi-task Learning for Multimodal Affect Recognition." arXiv preprint arXiv: 2104.11560, 2021.
Nguyen, K. Nguyen, S. Sridharan, A. Ghasemi, D. Dean, and C. Fookes, "Deep spatiotemporal features for multimodal emotion recognition." in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, IEEE, pp. 1215–1223, 2017.
Guo, Kairui, Henry Candra, Hairong Yu, Huiqi Li, Hung T. Nguyen, and Steven W. Su. "EEG-based emotion classification using innovative features and combined SVM and HMM classifier." In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), IEEE, pp. 489-492, 2017.
K. Zhang, Y. Li, J. Wang, Z. Wang and X. Li, "Feature Fusion for Multimodal Emotion Recognition Based on Deep Canonical Correlation Analysis." in IEEE Signal Processing Letters, vol. 28, pp. 1898-1902, 2021.
L. Cai, J. Dong, and M. Wei, "Multi-Modal Emotion Recognition From Speech and Facial Expression Based on Deep Learning," 2020 Chinese Automation Congress (CAC), pp. 5726-5729, 2020.
Lee, Sanghyun, David K. Han, and HanseokKo. "Multimodal Emotion Recognition Fusion Analysis Adapting BERT with Heterogeneous Feature Unification." IEEE Access, vol. 9, 2021.
Li, Mi, HongpeiXu, Xingwang Liu, and Shengfu Lu. "Emotion recognition from multichannel EEG signals using K-nearest neighbor classification." Technology and health care, vol. 26, no. S1, pp. 509-519, 2018.
Nath, Debarshi, Mrigank Singh, DivyashikhaSethia, Diksha Kalra, and S. Indu. "An efficient approach to EEG-based emotion recognition using LSTM network." In 2020 16th IEEE international colloquium on signal processing & its Applications (CSPA), IEEE, pp. 88-92, 2020.
N. Förster, and A. Mehler, “Twitter Author Topic Modeling-Comparative and Classifactory Topic Analysis Using Latent Dirichlet Allocation.” 2021.
Noroozi, Fatemeh, Marina Marjanovic, Angelina Njegus, Sergio Escalera, and GholamrezaAnbarjafari. "Audio-visual emotion recognition in video clips." IEEE Transactions on Affective Computing, vol. 10, no. 1, pp. 60-75, 2017.
N. Vaswani, N. Shazeer, J. Parmar, et al., “Attention is all you need.” in Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
Pandey, Pallavi, and K. R. Seeja. "Emotional state recognition with EEG signals using the subject independent approach." In Data Science and Big Data Analytics, Springer, Singapore, pp. 117-1, 2019.
P. Ekman, V. Wallace, “Unmasking the Face.” Malor Book, Cambridge, 2003.
P. Tzirakis, G. Trigeorgis, M. A. Nicolaou, B. W. Schuller, and S. Zafeiriou, "End-to-End Multimodal Emotion Recognition Using Deep Neural Networks." in IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1301-1309, Dec. 2017.
R. W. Picard, E. Vyzas, J. Healey, “Toward machine emotional intelligence: analysis of the affective physiological state.” IEEE Trans. Pattern Anal. Mach. Intell. vol. 23, no. 10, pp. 1175–1191, 2001.
S. Cohan, “Incongruous Entertainment: Camp, Cultural Value, and the MGM Musical.” 2005.
S. Huang, Z. Niu, Z., and C. Shi, “Automatic construction of domain-specific sentiment lexicon based on constrained label propagation.” Knowledge-Based Systems, Vol. 56, pp. 191–200, 2014.doi:10.1016/j.knosys.2013.11.009
W. L. Zheng, W. Liu, Y. Lu, B. -L. Lu and A. Cichocki, "EmotionMeter: A Multimodal Framework for Recognizing Human Emotions." IEEE Transactions on Cybernetics, vol. 49, no. 3, pp. 1110-1122, March. 2019.
Wu, Xun, Wei-Long Zheng, and Bao-Liang Lu. "Investigating EEG-based functional connectivity patterns for multimodal emotion recognition." arXiv preprint arXiv: 2004.01973, 2020.
W. V. Friesen, P. Ekman, “Emfacs-7: Emotional facial action coding system.” Unpublished manuscript, University of California at San Francisco.
W. Zheng, J. Zhu, and B. Lu, "Identifying Stable Patterns over Time for Emotion Recognition from EEG," in IEEE Transactions on Affective Computing, vol. 10, no. 3, pp. 417-429, 1 July-Sept. 2019, doi 10.1109/TAFFC.2017.2712143.
Y. Cimtay, E. Ekmekcioglu, and S. Caglar-Ozhan, "Cross-Subject Multimodal Emotion Recognition Based on Hybrid Fusion," in IEEE Access, vol. 8, pp. 168865-168878, 2020.
Zhang, Xiaowei, Jinyong Liu, JianShen, Shaojie Li, et al. "Emotion recognition from multimodal physiological signals using a regularized deep fusion of kernel machine." IEEE transactions on cybernetics, 2020.
Zhou, Hengshun, Jun Du, Yuanyuan Zhang, Qing Wang, Qing-Feng Liu, and Chin-Hui Lee, "Information fusion in attention networks using adaptive and multi-level factorized bilinear pooling for audio-visual emotion recognition." IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, 2021
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
Terms:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.