Dual-branch scale disentanglement for text–video retrieval

Koo, Hyunjoon; Shin, Jungkyoo; Kim, Eunwoo

doi:10.1016/j.patrec.2025.06.014

상세 보기

Dual-branch scale disentanglement for text–video retrieval

Koo, Hyunjoon;
Shin, Jungkyoo;
Kim, Eunwoo

Citations

WEB OF SCIENCE

0

Citations

SCOPUS

0

초록

In multi-modal understanding, text–video retrieval task, which aims to align videos with the corresponding texts, has gained increasing attention. Previous studies involved aligning fine-grained and coarse-grained features of videos and texts using a single model framework. However, the inherent differences between local and global features may result in entangled representations, leading to sub-optimal results. To address this issue, we introduce an approach to disentangle distinct modality features. Using a dual-branch structure, our method projects local and global features into distinct latent spaces. Each branch employs a different neural network and a loss function, facilitating independent learning of each feature and effectively capturing detailed and comprehensive features. We demonstrate the effectiveness of our method for text–video retrieval task across three different benchmarks, showing improvements over existing methods. It outperforms the compared methods by an average of +1.0%, +0.9%, and +0.6% in R@1 on MSR-VTT, LSMDC and MSVD, respectively © 2025 Elsevier B.V.

키워드

Contrastive learning; Dual-path learning; Multi-modal learning; Text–video retrieval

제목: Dual-branch scale disentanglement for text–video retrieval

저자: Koo, Hyunjoon; Shin, Jungkyoo; Kim, Eunwoo

DOI: 10.1016/j.patrec.2025.06.014

발행일: 2025-10

유형: Article

저널명: Pattern Recognition Letters

권: 196

페이지: 296 ~ 302