Performance of an Artificial Intelligence–Based Software for Automated Kellgren-Lawrence Grading of Knee Osteoarthritis: A Multicenter Cohort Study

Choi, Byung Sun; Hong, Sung Hwan; Lee, Han-Jun; Kim, Seong Hwan

doi:10.1016/j.arth.2026.01.078

상세 보기

Performance of an Artificial Intelligence–Based Software for Automated Kellgren-Lawrence Grading of Knee Osteoarthritis: A Multicenter Cohort Study

Choi, Byung Sun;
Hong, Sung Hwan;
Lee, Han-Jun;
Kim, Seong Hwan

Citations

SCOPUS

0

초록

Background The Kellgren-Lawrence (KL) grading system is the standard for assessing knee osteoarthritis (OA) severity. However, it is limited by major observer variability. Artificial intelligence (AI) may standardize grading, yet external validation is limited. This study evaluated the diagnostic efficacy of AI-based software on a large, independent, multicenter clinical dataset. Methods This multicenter, retrospective, pivotal study included 2,546 knee radiographs from 1,273 patients across two tertiary hospitals in Korea. A reference standard was established by an expert consensus panel, with KL grades 0 and 1 consolidated into a single KL ≤ 1 category. The AI software was trained exclusively on public United States datasets (Osteoarthritis Initiative and the Multicenter Osteoarthritis Study) and validated on this separate Korean dataset. The primary outcomes were grade-specific sensitivity and specificity for four categories (≤ 1, 2, 3, and 4). The secondary outcomes included accuracy, the area under the receiver operating characteristics curve, and binary diagnostic performance for radiographic OA (KL ≥ 2). Results The AI met all prespecified noninferiority endpoints. For KL ≤ 1, sensitivity was 90.5% (95% confidence interval (CI), 87.9 to 92.8) and specificity was 96.6% (95% CI, 95.0 to 97.9). For KL grade 4, sensitivity was 97.7% (95% CI, 97.5 to 99.1), and specificity was 98.4% (95% CI, 97.5 to 99.1). For KL grade 2, sensitivity was 77.2% and specificity was 95.3%. In the binary classification of radiographic OA, the AI achieved an area under the curve of 0.94 (95% CI, 0.92 to 0.96), sensitivity of 96.6%, specificity of 90.5%, and accuracy of 94.2%. Conclusions In a large-scale, multicenter external validation using a dataset entirely independent of its training data, the AI-based software demonstrated high and robust diagnostic performance for KL grading. These findings support the software’s potential for clinical integration to improve the consistency, objectivity, and efficiency of knee OA severity assessment. Level of Evidence Level III.

키워드

Artificial Intelligence; Diagnosis; Kellgren-Lawrence; Knee Osteoarthritis; Software

제목: Performance of an Artificial Intelligence–Based Software for Automated Kellgren-Lawrence Grading of Knee Osteoarthritis: A Multicenter Cohort Study

저자: Choi, Byung Sun; Hong, Sung Hwan; Lee, Han-Jun; Kim, Seong Hwan

DOI: 10.1016/j.arth.2026.01.078

발행일: 2026-02

유형: Journal Article

저널명: Journal of Arthroplasty