KRongBERT: Enhanced factorization-based morphological approach for the Korean pretrained language model

Yu, Hyunwook; Cho, Yejin; Park, Geunchul; Kim, Mucheol

doi:10.1016/j.ipm.2025.104072

상세 보기

KRongBERT: Enhanced factorization-based morphological approach for the Korean pretrained language model

Yu, Hyunwook;
Cho, Yejin;
Park, Geunchul;
Kim, Mucheol

Citations

WEB OF SCIENCE

0

Citations

SCOPUS

0

초록

The bidirectional encoder representations from transformers (BERT) model has achieved remarkable success in various natural language processing tasks for Latin-based languages. However, the Korean language presents unique challenges with limited data resources and complex linguistic structures. In this paper, we present KRongBERT, a language model specifically designed through a morphological approach to effectively address the unique linguistic complexities of Korean. KRongBERT mitigates the out-of-vocabulary issues that arise with byte-pair-encoding tokenizers in Korean and incorporates language-specific embedding layers to enhance understanding. Our model demonstrates up to an 1.56% improvement in performance on specific natural language understanding tasks compared to the traditional BERT implementations. Notably, KRongBERT achieves superior performance compared to existing state-of-the-art Korean BERT models while utilizing only 11.42% of the data required by other models. Our experiments demonstrate that KRongBERT efficiently handles the complexities of the Korean language, outperforming current state-of-the-art approaches. The code is publicly available at https://github.com/Splo2t/KRongBERT. © 2025 The Authors

키워드

BERT; Korean language; Korean pretrained language model; Natural language processing; Tokenization

제목: KRongBERT: Enhanced factorization-based morphological approach for the Korean pretrained language model

저자: Yu, Hyunwook; Cho, Yejin; Park, Geunchul; Kim, Mucheol

DOI: 10.1016/j.ipm.2025.104072

발행일: 2025-05

유형: Article

저널명: Information Processing and Management

권: 62

호: 3

상세 보기

KRongBERT: Enhanced factorization-based morphological approach for the Korean pretrained language model

초록

키워드

파일 다운로드