KRongBERT: Enhanced factorization-based morphological approach for the Korean pretrained language model
Citations

WEB OF SCIENCE

0
Citations

SCOPUS

0

초록

The bidirectional encoder representations from transformers (BERT) model has achieved remarkable success in various natural language processing tasks for Latin-based languages. However, the Korean language presents unique challenges with limited data resources and complex linguistic structures. In this paper, we present KRongBERT, a language model specifically designed through a morphological approach to effectively address the unique linguistic complexities of Korean. KRongBERT mitigates the out-of-vocabulary issues that arise with byte-pair-encoding tokenizers in Korean and incorporates language-specific embedding layers to enhance understanding. Our model demonstrates up to an 1.56% improvement in performance on specific natural language understanding tasks compared to the traditional BERT implementations. Notably, KRongBERT achieves superior performance compared to existing state-of-the-art Korean BERT models while utilizing only 11.42% of the data required by other models. Our experiments demonstrate that KRongBERT efficiently handles the complexities of the Korean language, outperforming current state-of-the-art approaches. The code is publicly available at https://github.com/Splo2t/KRongBERT. © 2025 The Authors

키워드

BERTKorean languageKorean pretrained language modelNatural language processingTokenization
제목
KRongBERT: Enhanced factorization-based morphological approach for the Korean pretrained language model
저자
Yu, HyunwookCho, YejinPark, GeunchulKim, Mucheol
DOI
10.1016/j.ipm.2025.104072
발행일
2025-05
유형
Article
저널명
Information Processing and Management
62
3

파일 다운로드