AI Stat Lab: A Modular Framework for Clinically Accurate Medical Image Captioning Using Vision-Language Models

Lee, Yunseo; Kim, Hyun Jun; Shin, Heeseung; Lim, Changwon

상세 보기

Lee, Yunseo;
Kim, Hyun Jun;
Shin, Heeseung;
Lim, Changwon

Citations

WEB OF SCIENCE

0

Citations

SCOPUS

0

초록

We propose a modular framework for medical image captioning that integrates domain-adapted visual encoders, token-efficient representation via query-based compression, and post-hoc refinement. The architecture employs an ensemble of general-purpose and domain-specific vision encoders (SigLIP2 and BioMedCLIP), a Q-Former for dense concept-aware tokenization, and a LoRA-tuned Bio-Medical LLaMA-3 decoder. Auxiliary objectives guide the model to jointly predict UMLS concepts and semantic types, improving semantic grounding. At inference, captions from six independently trained variants are reranked using three complementary strategies—BioMedCLIP similarity, BLEURT scoring, and BioBERT-based centroid alignment. Evaluations on the ImageCLEF2025 Caption Prediction Task demonstrate consistent gains in semantic relevance and clinical factuality over single-encoder and non-multitask baselines. Our approach (team: AI Stat Lab, ID #1900) achieved third place with an overall score of 0.3229, corresponding to relevance and factuality scores of 0.5089 and 0.1369, respectively.

키워드

Caption reranking; Dual Encoder; GPT summarization; Medical image captioning; UMLS concepts; Vision-language model

제목: AI Stat Lab: A Modular Framework for Clinically Accurate Medical Image Captioning Using Vision-Language Models

저자: Lee, Yunseo; Kim, Hyun Jun; Shin, Heeseung; Lim, Changwon

발행일: 2025

유형: Conference Paper

저널명: CEUR Workshop Proceedings

권: 4038

페이지: 2511 ~ 2523

ScholarWorks@중앙대학교

상세 보기

초록

키워드