Text-Centric Multimodal Alignment via Dual-Level Optimization

Citations

SCOPUS

0

초록

Multimodal learning seeks to align representations across modalities. While symmetric contrastive strategies maximize all pairwise similarities equally, we argue that language should serve as semantic hub in audio-visual-text learning. We propose a text-centric framework that combines instance-level contrastive alignment (InfoNCE) with distribution-level matching (MMD), enforcing precise modality-text correspondence while maintaining flexible audio-visual compatibility. This dual-level optimization leverages language's semantic richness without collapsing modality-specific structure. Experiments on VGGSound show significant gains over Wav2CLIP in retrieval accuracy and compositional generalization. Our results highlight that integrating instance-level precision with distribution-level coherence overcomes key limitations of purely contrastive methods.

키워드

audio-visuallanguage modelscontrastive learningdistribution matchingmultimodal alignmenttext-centric learning
제목
Text-Centric Multimodal Alignment via Dual-Level Optimization
저자
Hong, JinPark, JuHyeonKwon, Junseok
DOI
10.1109/ICOIN68469.2026.11480517
발행일
2026
유형
Conference Paper
저널명
International Conference on Information Networking
페이지
971 ~ 974