상세 보기
- Shin, Jungkyoo;
- Kang, Sungmin;
- Cho, Yoonsik;
- Kim, Eunwoo
WEB OF SCIENCE
0초록
In this paper, we introduce a novel approach to capture temporal information in videos across multiple scales for cross-modal learning. As videos naturally encapsulate semantic information of diverse durations, existing methods that primarily depend on fine- and coarse-grained contrastive learning may fail to fully capture the inherent semantic information. To bridge this gap, we propose Dynamic Scale Position Embedding (DSPE), a novel approach that enables a single transformer to interpret videos at various temporal scales through dynamic adjustment of temporal position embedding. In contrast to conventional multi-scale methods that aggregate video clips, DSPE maintains the distinct features of each clip, thus preserving semantic integrity and enhancing semantic content comprehension. Based on this, we present an efficient multi-scale temporal encoder designed to adeptly capture temporal information across a broad spectrum from fine to coarse granularity. Comprehensive experiments across four datasets-MSR-VTT, LSMDC, MSVD, and ActivityNet-Captions-and two distinct tasks-text-video retrieval and video-captioning-with consistent performance improvements highlight the significance of the presented multi-scale approach. Copyright © 2025 Elsevier Ltd. All rights reserved.
키워드
- 제목
- Dynamic scale position embedding for cross-modal representation learning
- 저자
- Shin, Jungkyoo; Kang, Sungmin; Cho, Yoonsik; Kim, Eunwoo
- 발행일
- 2026-01
- 유형
- Article
- 저널명
- Neural Networks
- 권
- 193