Dual-branch scale disentanglement for text–video retrieval
Citations

WEB OF SCIENCE

0
Citations

SCOPUS

0

초록

In multi-modal understanding, text–video retrieval task, which aims to align videos with the corresponding texts, has gained increasing attention. Previous studies involved aligning fine-grained and coarse-grained features of videos and texts using a single model framework. However, the inherent differences between local and global features may result in entangled representations, leading to sub-optimal results. To address this issue, we introduce an approach to disentangle distinct modality features. Using a dual-branch structure, our method projects local and global features into distinct latent spaces. Each branch employs a different neural network and a loss function, facilitating independent learning of each feature and effectively capturing detailed and comprehensive features. We demonstrate the effectiveness of our method for text–video retrieval task across three different benchmarks, showing improvements over existing methods. It outperforms the compared methods by an average of +1.0%, +0.9%, and +0.6% in R@1 on MSR-VTT, LSMDC and MSVD, respectively © 2025 Elsevier B.V.

키워드

Contrastive learningDual-path learningMulti-modal learningText–video retrieval
제목
Dual-branch scale disentanglement for text–video retrieval
저자
Koo, HyunjoonShin, JungkyooKim, Eunwoo
DOI
10.1016/j.patrec.2025.06.014
발행일
2025-10
유형
Article
저널명
Pattern Recognition Letters
196
페이지
296 ~ 302