Dynamic scale position embedding for cross-modal representation learning

Shin, Jungkyoo; Kang, Sungmin; Cho, Yoonsik; Kim, Eunwoo

doi:10.1016/j.neunet.2025.108087

상세 보기

Dynamic scale position embedding for cross-modal representation learning

Shin, Jungkyoo;
Kang, Sungmin;
Cho, Yoonsik;
Kim, Eunwoo

Citations

WEB OF SCIENCE

0

초록

In this paper, we introduce a novel approach to capture temporal information in videos across multiple scales for cross-modal learning. As videos naturally encapsulate semantic information of diverse durations, existing methods that primarily depend on fine- and coarse-grained contrastive learning may fail to fully capture the inherent semantic information. To bridge this gap, we propose Dynamic Scale Position Embedding (DSPE), a novel approach that enables a single transformer to interpret videos at various temporal scales through dynamic adjustment of temporal position embedding. In contrast to conventional multi-scale methods that aggregate video clips, DSPE maintains the distinct features of each clip, thus preserving semantic integrity and enhancing semantic content comprehension. Based on this, we present an efficient multi-scale temporal encoder designed to adeptly capture temporal information across a broad spectrum from fine to coarse granularity. Comprehensive experiments across four datasets-MSR-VTT, LSMDC, MSVD, and ActivityNet-Captions-and two distinct tasks-text-video retrieval and video-captioning-with consistent performance improvements highlight the significance of the presented multi-scale approach. Copyright © 2025 Elsevier Ltd. All rights reserved.

키워드

Multi-modal learning; Position embedding; Representation learning

제목: Dynamic scale position embedding for cross-modal representation learning

저자: Shin, Jungkyoo; Kang, Sungmin; Cho, Yoonsik; Kim, Eunwoo

DOI: 10.1016/j.neunet.2025.108087

발행일: 2026-01

유형: Article

저널명: Neural Networks

권: 193