Robust Audio-Visual Fusion for Emotion Recognition Based on Cross-Modal Learning under Noisy Conditions

Moon, A-Seong; Jeong, Seungyeon; Kim, Donghee; Zulkifley, Mohd Asyraf; Sohn, Bong-Soo; Lee, Jaesung

doi:10.32604/cmc.2025.067103

상세 보기

Robust Audio-Visual Fusion for Emotion Recognition Based on Cross-Modal Learning under Noisy Conditions

Moon, A-Seong;
Jeong, Seungyeon;
Kim, Donghee;
Zulkifley, Mohd Asyraf;
Sohn, Bong-Soo;
... Lee, Jaesung

Citations

WEB OF SCIENCE

0

Citations

SCOPUS

0

초록

Emotion recognition under uncontrolled and noisy environments presents persistent challenges in the design of emotionally responsive systems. The current study introduces an audio-visual recognition framework designed to address performance degradation caused by environmental interference, such as background noise, overlapping speech, and visual obstructions. The proposed framework employs a structured fusion approach, combining early-stage feature-level integration with decision-level coordination guided by temporal attention mechanisms. Audio data are transformed into mel-spectrogram representations, and visual data are represented as raw frame sequences. Spatial and temporal features are extracted through convolutional and transformer-based encoders, allowing the framework to capture complementary and hierarchical information from both sources. A cross-modal attention module enables selective emphasis on relevant signals while suppressing modality-specific noise. Performance is validated on a modified version of the AFEW dataset, in which controlled noise is introduced to emulate realistic conditions. The framework achieves higher classification accuracy than comparative baselines, confirming increased robustness under conditions of cross-modal disruption. This result demonstrates the suitability of the proposed method for deployment in practical emotion-aware technologies operating outside controlled environments. The study also contributes a systematic approach to fusion design and supports further exploration in the direction of resilient multimodal emotion analysis frameworks. The source code is publicly available at https://github.com/asmoon002/AVER (accessed on 18 August 2025).

키워드

cross-modal attention; emotion recognition; Multimodal learning; robust representation learning

제목: Robust Audio-Visual Fusion for Emotion Recognition Based on Cross-Modal Learning under Noisy Conditions

저자: Moon, A-Seong; Jeong, Seungyeon; Kim, Donghee; Zulkifley, Mohd Asyraf; Sohn, Bong-Soo; Lee, Jaesung

DOI: 10.32604/cmc.2025.067103

발행일: 2025

유형: Article

저널명: Computers, Materials and Continua

권: 85

호: 2

페이지: 2851 ~ 2872

상세 보기

Robust Audio-Visual Fusion for Emotion Recognition Based on Cross-Modal Learning under Noisy Conditions

초록

키워드

파일 다운로드