상세 보기
- Lee, Na-Hyun;
- Kang, Seong-Min;
- Cho, Yoon-Sik
SCOPUS
0초록
With the popularity of video sharing applications and streaming platforms, video retrieval became an active research topic. The core technique behind the video retrieval is aligning and matching the embeddings from two different modalities: text and video. However, conventional approaches face the challenge of the modality information gap, which is primarily due to the information disparity between vision and language. Typically, text contains general information and has relatively limited amount of details, making it difficult to precisely match with video, which generally has a specific details and extensive contents. To address this problem, we propose a novel text augmentation method to reduce the information gap between modalities in the text-to-video retrieval task. First, the Latent Space Transformation Module (LAST module) generates new augmented sentences in the latent space while preserving the semantic information of the original sentences. These augmented sentences increase the amount of information in the text, helping to bridge the information gap between text and video. Second, we propose a gradual learning approach through the Gradual Impact Weighting Module (GIW module), which allows the augmented sentences to incrementally influence the model, thereby enhancing model performance. Our model conducts various experiments on three benchmark datasets, MSR-VTT, LSMDC, and MSVD, and achieves competitive performance. In particular, our model is designed in a plug-and-play manner, which can be easily applied with various existing models without the need for additional parameters, resulting in significant performance improvements. © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.
키워드
- 제목
- Efficient text augmentation in latent space for video retrieval
- 저자
- Lee, Na-Hyun; Kang, Seong-Min; Cho, Yoon-Sik
- 발행일
- 2025-07
- 유형
- Article
- 권
- 84
- 호
- 25
- 페이지
- 30135 ~ 30153