상세 보기
- Kang, Seong-Min;
- Lee, Na-Hyun;
- Park, Ji-Ho;
- Cho, Yoon-Sik
WEB OF SCIENCE
0SCOPUS
0초록
Vision-language models pretrained on image-text pairs have demonstrated strong performance in text-to-video retrieval through contrastive learning. However, videos contain much richer temporal and spatial information than their paired captions. Due to this discrepancy, each caption in the training set only corresponds to a subset of frames in its video, which poses a challenge. This challenge is amplified in negative pairs, where negative captions can be partially relevant to the video despite being mismatched. These hard negatives deviate from the conventional unimodal contrastive setting, which requires further attention. To this end, we propose Disentangled Representation Graph Infomax (DRGI), a model-agnostic framework that better exploits hard negatives. DRGI constructs fully connected graphs from disentangled video and text representations, where graph attention captures inter-node dependencies within each modality. We optimize an InfoMax objective between node-level and graph-level representations using Deep Graph Infomax. Hard negatives are treated as semantically corrupted graphs, encouraging the model to separate misleading patterns from true alignments. Extensive experiments on MSR-VTT, LSMDC, MSVD, and ActivityNet demonstrate that DRGI consistently outperforms base models, achieving state-of-the-art performance with up to 2.3% improvement in R@1 on MSR-VTT. More encouragingly, our plug-and-play framework can be seamlessly integrated into existing CLIP-based retrieval models, adding only 0.05% of parameters during training with no additional inference cost. Our code is available at https://github.com/kang7734/_DRGI_
키워드
- 제목
- DRGI: Disentangled Representation Graph Infomax for Video Retrieval
- 저자
- Kang, Seong-Min; Lee, Na-Hyun; Park, Ji-Ho; Cho, Yoon-Sik
- 발행일
- 2026
- 유형
- Article
- 저널명
- IEEE Access
- 권
- 14
- 페이지
- 26504 ~ 26515