Effective audio-visual event localization using CLIP-based global context regulation for mitigating event overconfidence
Citations

WEB OF SCIENCE

0
Citations

SCOPUS

0

초록

Audio-Visual event localization refers to identifying events that are visible and audible in videos using joint modeling of auditory and visual modalities to detect these events in temporal video segments. A challenge arises when the audio and visual contexts are inconsistent, and this information is clearly present (e.g., the on-screen visual shows a baby crying, while an off-screen female is speaking, resulting in conflicting information between the modalities). In such cases, both modalities exhibit high significance values, causing the model to misclassify background as an event. To address this, we propose a CLIP-based global context regulation method that leverages a pre-trained AudioCLIP encoder. This approach effectively regulates event-relevant scores through post-processing and performs well even with limited training data containing inconsistencies. We introduce a benchmark dataset annotated for inconsistent cases to facilitate robust evaluation. Experimental results demonstrate that our model outperforms existing methods and achieves state-of-the-art performance in event localization. These findings highlight the importance of regulating event overconfidence in multimodal inconsistency, contributing to more accurate event localization in real-world applications. Our code and dataset are available at: https://github.com/PangRAK/GCRN

키워드

Audio-visual event localizationContext regulationCross-modality attentionMultimodal learning
제목
Effective audio-visual event localization using CLIP-based global context regulation for mitigating event overconfidence
저자
Lee, Sang-RakMoon, A-SeongSohn, Bong-SooLee, Jaesung
DOI
10.1016/j.patcog.2026.113312
발행일
2026-09
유형
Article
저널명
Pattern Recognition
177