Effective Token Masking Augmentation Using Term-Document Frequency for Language Model-Based Legal Case Classification

Park, Ye-Chan; Zulkifley, Mohd Asyraf; Sohn, Bong-Soo; Lee, Jaesung

doi:10.32604/cmc.2025.074141

상세 보기

Effective Token Masking Augmentation Using Term-Document Frequency for Language Model-Based Legal Case Classification

Park, Ye-Chan;
Zulkifley, Mohd Asyraf;
Sohn, Bong-Soo;
Lee, Jaesung

Citations

WEB OF SCIENCE

0

Citations

SCOPUS

0

초록

Legal case classification involves the categorization of legal documents into predefined categories, which facilitates legal information retrieval and case management. However, real-world legal datasets often suffer from class imbalances due to the uneven distribution of case types across legal domains. This leads to biased model performance, in the form of high accuracy for overrepresented categories and underperformance for minority classes. To address this issue, in this study, we propose a data augmentation method that masks unimportant terms within a document selectively while preserving key terms from the perspective of the legal domain. This approach enhances data diversity and improves the generalization capability of conventional models. Our experiments demonstrate consistent improvements achieved by the proposed augmentation strategy in terms of accuracy and F1 score across all models, validating the effectiveness of the proposed method in legal case classification.

키워드

Legal case classification; class imbalance; data augmentation; token masking; legal NLP

제목: Effective Token Masking Augmentation Using Term-Document Frequency for Language Model-Based Legal Case Classification

저자: Park, Ye-Chan; Zulkifley, Mohd Asyraf; Sohn, Bong-Soo; Lee, Jaesung

DOI: 10.32604/cmc.2025.074141

발행일: 2026

유형: Article

저널명: Computers, Materials and Continua

권: 87

호: 1

상세 보기

Effective Token Masking Augmentation Using Term-Document Frequency for Language Model-Based Legal Case Classification

초록

키워드

파일 다운로드