상세 보기
- Yu, Hyunwook;
- Cho, Yejin;
- Park, Geunchul;
- Kim, Mucheol
WEB OF SCIENCE
0SCOPUS
0초록
The bidirectional encoder representations from transformers (BERT) model has achieved remarkable success in various natural language processing tasks for Latin-based languages. However, the Korean language presents unique challenges with limited data resources and complex linguistic structures. In this paper, we present KRongBERT, a language model specifically designed through a morphological approach to effectively address the unique linguistic complexities of Korean. KRongBERT mitigates the out-of-vocabulary issues that arise with byte-pair-encoding tokenizers in Korean and incorporates language-specific embedding layers to enhance understanding. Our model demonstrates up to an 1.56% improvement in performance on specific natural language understanding tasks compared to the traditional BERT implementations. Notably, KRongBERT achieves superior performance compared to existing state-of-the-art Korean BERT models while utilizing only 11.42% of the data required by other models. Our experiments demonstrate that KRongBERT efficiently handles the complexities of the Korean language, outperforming current state-of-the-art approaches. The code is publicly available at https://github.com/Splo2t/KRongBERT. © 2025 The Authors
키워드
- 제목
- KRongBERT: Enhanced factorization-based morphological approach for the Korean pretrained language model
- 저자
- Yu, Hyunwook; Cho, Yejin; Park, Geunchul; Kim, Mucheol
- 발행일
- 2025-05
- 유형
- Article
- 권
- 62
- 호
- 3