상세 보기
Text-Centric Multimodal Alignment via Dual-Level Optimization
- Hong, Jin;
- Park, JuHyeon;
- Kwon, Junseok
SCOPUS
0초록
Multimodal learning seeks to align representations across modalities. While symmetric contrastive strategies maximize all pairwise similarities equally, we argue that language should serve as semantic hub in audio-visual-text learning. We propose a text-centric framework that combines instance-level contrastive alignment (InfoNCE) with distribution-level matching (MMD), enforcing precise modality-text correspondence while maintaining flexible audio-visual compatibility. This dual-level optimization leverages language's semantic richness without collapsing modality-specific structure. Experiments on VGGSound show significant gains over Wav2CLIP in retrieval accuracy and compositional generalization. Our results highlight that integrating instance-level precision with distribution-level coherence overcomes key limitations of purely contrastive methods.
키워드
- 제목
- Text-Centric Multimodal Alignment via Dual-Level Optimization
- 저자
- Hong, Jin; Park, JuHyeon; Kwon, Junseok
- 발행일
- 2026
- 유형
- Conference Paper
- 저널명
- International Conference on Information Networking
- 페이지
- 971 ~ 974