A method for emotion transition recognition using cross-modal feature fusion and global perception.
10.7507/1001-5515.202504040
- Author:
Lilin JIE
1
;
Yangmeng ZOU
1
;
Zhengxiu LI
1
;
Baoliang LYU
2
;
Weilong ZHENG
2
;
Ming LI
1
Author Information
1. Key Laboratory of Jiangxi Province for Image Processing and Pattern Recognition, Nanchang Hangkong University, Nanchang 330063, P. R. China.
2. The Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, P. R. China.
- Publication Type:Journal Article
- Keywords:
Cross-modal feature fusion;
Deep canonical correlation analysis;
Deep learning model;
Emotion recognition;
Emotion transition
- MeSH:
Humans;
Emotions/physiology*;
Electroencephalography;
Neural Networks, Computer;
Eye Movements;
Perception
- From:
Journal of Biomedical Engineering
2025;42(5):977-986
- CountryChina
- Language:Chinese
-
Abstract:
Current studies on electroencephalogram (EEG) emotion recognition primarily concentrate on discrete stimulus paradigms under controlled laboratory settings, which cannot adequately represent the dynamic transition characteristics of emotional states during multi-context interactions. To address this issue, this paper proposes a novel method for emotion transition recognition that leverages a cross-modal feature fusion and global perception network (CFGPN). Firstly, an experimental paradigm encompassing six types of emotion transition scenarios was designed, and EEG and eye movement data were simultaneously collected from 20 participants, each annotated with dynamic continuous emotion labels. Subsequently, deep canonical correlation analysis integrated with a cross-modal attention mechanism was employed to fuse features from EEG and eye movement signals, resulting in multimodal feature vectors enriched with highly discriminative emotional information. These vectors are then input into a parallel hybrid architecture that combines convolutional neural networks (CNNs) and Transformers. The CNN is employed to capture local time-series features, whereas the Transformer leverages its robust global perception capabilities to effectively model long-range temporal dependencies, enabling accurate dynamic emotion transition recognition. The results demonstrate that the proposed method achieves the lowest mean square error in both valence and arousal recognition tasks on the dynamic emotion transition dataset and a classic multimodal emotion dataset. It exhibits superior recognition accuracy and stability when compared with five existing unimodal and six multimodal deep learning models. The approach enhances both adaptability and robustness in recognizing emotional state transitions in real-world scenarios, showing promising potential for applications in the field of biomedical engineering.