Dynamic scale position embedding for cross-modal representation learning
Citations

WEB OF SCIENCE

0

초록

In this paper, we introduce a novel approach to capture temporal information in videos across multiple scales for cross-modal learning. As videos naturally encapsulate semantic information of diverse durations, existing methods that primarily depend on fine- and coarse-grained contrastive learning may fail to fully capture the inherent semantic information. To bridge this gap, we propose Dynamic Scale Position Embedding (DSPE), a novel approach that enables a single transformer to interpret videos at various temporal scales through dynamic adjustment of temporal position embedding. In contrast to conventional multi-scale methods that aggregate video clips, DSPE maintains the distinct features of each clip, thus preserving semantic integrity and enhancing semantic content comprehension. Based on this, we present an efficient multi-scale temporal encoder designed to adeptly capture temporal information across a broad spectrum from fine to coarse granularity. Comprehensive experiments across four datasets-MSR-VTT, LSMDC, MSVD, and ActivityNet-Captions-and two distinct tasks-text-video retrieval and video-captioning-with consistent performance improvements highlight the significance of the presented multi-scale approach. Copyright © 2025 Elsevier Ltd. All rights reserved.

키워드

Multi-modal learningPosition embeddingRepresentation learning
제목
Dynamic scale position embedding for cross-modal representation learning
저자
Shin, JungkyooKang, SungminCho, YoonsikKim, Eunwoo
DOI
10.1016/j.neunet.2025.108087
발행일
2026-01
유형
Article
저널명
Neural Networks
193