DRGI: Disentangled Representation Graph Infomax for Video Retrieval

Kang, Seong-Min; Lee, Na-Hyun; Park, Ji-Ho; Cho, Yoon-Sik

doi:10.1109/ACCESS.2026.3662719

상세 보기

DRGI: Disentangled Representation Graph Infomax for Video Retrieval

Kang, Seong-Min;
Lee, Na-Hyun;
Park, Ji-Ho;
Cho, Yoon-Sik

Citations

WEB OF SCIENCE

0

Citations

SCOPUS

0

초록

Vision-language models pretrained on image-text pairs have demonstrated strong performance in text-to-video retrieval through contrastive learning. However, videos contain much richer temporal and spatial information than their paired captions. Due to this discrepancy, each caption in the training set only corresponds to a subset of frames in its video, which poses a challenge. This challenge is amplified in negative pairs, where negative captions can be partially relevant to the video despite being mismatched. These hard negatives deviate from the conventional unimodal contrastive setting, which requires further attention. To this end, we propose Disentangled Representation Graph Infomax (DRGI), a model-agnostic framework that better exploits hard negatives. DRGI constructs fully connected graphs from disentangled video and text representations, where graph attention captures inter-node dependencies within each modality. We optimize an InfoMax objective between node-level and graph-level representations using Deep Graph Infomax. Hard negatives are treated as semantically corrupted graphs, encouraging the model to separate misleading patterns from true alignments. Extensive experiments on MSR-VTT, LSMDC, MSVD, and ActivityNet demonstrate that DRGI consistently outperforms base models, achieving state-of-the-art performance with up to 2.3% improvement in R@1 on MSR-VTT. More encouragingly, our plug-and-play framework can be seamlessly integrated into existing CLIP-based retrieval models, adding only 0.05% of parameters during training with no additional inference cost. Our code is available at https://github.com/kang7734/_DRGI_

키워드

disentangled representation; graph attention network; hard negative sample; text video retrieval

제목: DRGI: Disentangled Representation Graph Infomax for Video Retrieval

저자: Kang, Seong-Min; Lee, Na-Hyun; Park, Ji-Ho; Cho, Yoon-Sik

DOI: 10.1109/ACCESS.2026.3662719

발행일: 2026

유형: Article

저널명: IEEE Access

권: 14

페이지: 26504 ~ 26515

상세 보기

DRGI: Disentangled Representation Graph Infomax for Video Retrieval

초록

키워드

파일 다운로드