상세 보기
- Lee, Yunseo;
- Kim, Hyun Jun;
- Shin, Heeseung;
- Lim, Changwon
WEB OF SCIENCE
0SCOPUS
0초록
We propose a modular framework for medical image captioning that integrates domain-adapted visual encoders, token-efficient representation via query-based compression, and post-hoc refinement. The architecture employs an ensemble of general-purpose and domain-specific vision encoders (SigLIP2 and BioMedCLIP), a Q-Former for dense concept-aware tokenization, and a LoRA-tuned Bio-Medical LLaMA-3 decoder. Auxiliary objectives guide the model to jointly predict UMLS concepts and semantic types, improving semantic grounding. At inference, captions from six independently trained variants are reranked using three complementary strategies—BioMedCLIP similarity, BLEURT scoring, and BioBERT-based centroid alignment. Evaluations on the ImageCLEF2025 Caption Prediction Task demonstrate consistent gains in semantic relevance and clinical factuality over single-encoder and non-multitask baselines. Our approach (team: AI Stat Lab, ID #1900) achieved third place with an overall score of 0.3229, corresponding to relevance and factuality scores of 0.5089 and 0.1369, respectively.
키워드
- 제목
- AI Stat Lab: A Modular Framework for Clinically Accurate Medical Image Captioning Using Vision-Language Models
- 저자
- Lee, Yunseo; Kim, Hyun Jun; Shin, Heeseung; Lim, Changwon
- 발행일
- 2025
- 유형
- Conference Paper
- 저널명
- CEUR Workshop Proceedings
- 권
- 4038
- 페이지
- 2511 ~ 2523