상세 보기
- Koo, Hyunjoon;
- Shin, Jungkyoo;
- Kim, Eunwoo
WEB OF SCIENCE
0SCOPUS
0초록
In multi-modal understanding, text–video retrieval task, which aims to align videos with the corresponding texts, has gained increasing attention. Previous studies involved aligning fine-grained and coarse-grained features of videos and texts using a single model framework. However, the inherent differences between local and global features may result in entangled representations, leading to sub-optimal results. To address this issue, we introduce an approach to disentangle distinct modality features. Using a dual-branch structure, our method projects local and global features into distinct latent spaces. Each branch employs a different neural network and a loss function, facilitating independent learning of each feature and effectively capturing detailed and comprehensive features. We demonstrate the effectiveness of our method for text–video retrieval task across three different benchmarks, showing improvements over existing methods. It outperforms the compared methods by an average of +1.0%, +0.9%, and +0.6% in R@1 on MSR-VTT, LSMDC and MSVD, respectively © 2025 Elsevier B.V.
키워드
- 제목
- Dual-branch scale disentanglement for text–video retrieval
- 저자
- Koo, Hyunjoon; Shin, Jungkyoo; Kim, Eunwoo
- 발행일
- 2025-10
- 유형
- Article
- 권
- 196
- 페이지
- 296 ~ 302