AI Stat Lab: A Modular Framework for Clinically Accurate Medical Image Captioning Using Vision-Language Models
Citations

WEB OF SCIENCE

0
Citations

SCOPUS

0

초록

We propose a modular framework for medical image captioning that integrates domain-adapted visual encoders, token-efficient representation via query-based compression, and post-hoc refinement. The architecture employs an ensemble of general-purpose and domain-specific vision encoders (SigLIP2 and BioMedCLIP), a Q-Former for dense concept-aware tokenization, and a LoRA-tuned Bio-Medical LLaMA-3 decoder. Auxiliary objectives guide the model to jointly predict UMLS concepts and semantic types, improving semantic grounding. At inference, captions from six independently trained variants are reranked using three complementary strategies—BioMedCLIP similarity, BLEURT scoring, and BioBERT-based centroid alignment. Evaluations on the ImageCLEF2025 Caption Prediction Task demonstrate consistent gains in semantic relevance and clinical factuality over single-encoder and non-multitask baselines. Our approach (team: AI Stat Lab, ID #1900) achieved third place with an overall score of 0.3229, corresponding to relevance and factuality scores of 0.5089 and 0.1369, respectively.

키워드

Caption rerankingDual EncoderGPT summarizationMedical image captioningUMLS conceptsVision-language model
제목
AI Stat Lab: A Modular Framework for Clinically Accurate Medical Image Captioning Using Vision-Language Models
저자
Lee, YunseoKim, Hyun JunShin, HeeseungLim, Changwon
발행일
2025
유형
Conference Paper
저널명
CEUR Workshop Proceedings
4038
페이지
2511 ~ 2523