25 November 2022 Learning modality-fused representation based on transformer for emotion analysis
Piao Shi, Min Hu, Fuji Ren, Xuefeng Shi, Liangfeng Xu
Author Affiliations +
Abstract

Modality-fused representation is an essential and challenging task in multimodal emotion analysis. Previous studies have already yielded remarkable achievements. However, there are two problems: insufficient feature interaction and rough data fusion. To investigate these two challenges more deeply, first, a hybrid architecture, which consists of convolution and a transformer, is proposed to extract local and global features. Second, for extracting more sufficient mutual features from multimodal datasets, our model is comprised of three parts: (1) the interior transformer encoder (TE) aims to extract the intramodality characteristics from the current monomodality; (2) the between TE aims to extract the intermodality feature between two different modalities; and (3) the enhance TE aims to extract the target modality enhance feature from multimodality. Finally, instead of directly fusing features by a linear function, we employ a popular and widely used multimodal factorized high-order pooling mechanism to obtain a more distinguishable feature representation. Extensive experiments on three multimodal sentiment datasets (CMU-MOSEI, CMU-MOSI, and IEMOCAP) demonstrate that our approach reaches the state-of-the-art in an unaligned version setting. Compared with the mainstream methods, our proposed method shows superiority in both word-aligned and unaligned settings.

© 2022 SPIE and IS&T
Piao Shi, Min Hu, Fuji Ren, Xuefeng Shi, and Liangfeng Xu "Learning modality-fused representation based on transformer for emotion analysis," Journal of Electronic Imaging 31(6), 063032 (25 November 2022). https://doi.org/10.1117/1.JEI.31.6.063032
Received: 18 July 2022; Accepted: 3 November 2022; Published: 25 November 2022
Lens.org Logo
CITATIONS
Cited by 2 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Transformers

Emotion

Data modeling

Video

Feature extraction

Convolution

Visualization

Back to Top