상세 보기
- Lee, Sang-Rak;
- Moon, A-Seong;
- Sohn, Bong-Soo;
- Lee, Jaesung
WEB OF SCIENCE
0SCOPUS
0초록
Audio-Visual event localization refers to identifying events that are visible and audible in videos using joint modeling of auditory and visual modalities to detect these events in temporal video segments. A challenge arises when the audio and visual contexts are inconsistent, and this information is clearly present (e.g., the on-screen visual shows a baby crying, while an off-screen female is speaking, resulting in conflicting information between the modalities). In such cases, both modalities exhibit high significance values, causing the model to misclassify background as an event. To address this, we propose a CLIP-based global context regulation method that leverages a pre-trained AudioCLIP encoder. This approach effectively regulates event-relevant scores through post-processing and performs well even with limited training data containing inconsistencies. We introduce a benchmark dataset annotated for inconsistent cases to facilitate robust evaluation. Experimental results demonstrate that our model outperforms existing methods and achieves state-of-the-art performance in event localization. These findings highlight the importance of regulating event overconfidence in multimodal inconsistency, contributing to more accurate event localization in real-world applications. Our code and dataset are available at: https://github.com/PangRAK/GCRN
키워드
- 제목
- Effective audio-visual event localization using CLIP-based global context regulation for mitigating event overconfidence
- 저자
- Lee, Sang-Rak; Moon, A-Seong; Sohn, Bong-Soo; Lee, Jaesung
- 발행일
- 2026-09
- 유형
- Article
- 권
- 177