PURPOSE: Predicting local failure (LF), distant intracranial failure (DIF), and Radiation necrosis (RN) after stereotactic radiosurgery (SRS) requires consideration of numerous variables. The study objective was to evaluate machine-learning (ML) algorithms to accurately predict LF, DIF, or RN using relevant patient, disease-specific, and treatment-related factors.
METHODS: Patients with small brain metastases (≤2 cm max dimension) treated with single fraction SRS (20-24Gy) between 2017-2021 were included. Key variables used were age, gender, race, Karnofsky Performance Status (KPS), pathology, # of lesions, Paddick conformity index (PCI), target max dose, max dimension(cm), and post-SRS systemic therapy (immunotherapy, targeted therapy, or chemotherapy). Python’s ML library SciKit Learn was used to evaluate ML algorithms including logistic regression (LR), support vector machines (SVM), and random forest (RF) to predict the risk of LF, DIF, and RN (independently). A stratified 5-fold internal cross-validation 80%/20% training/test split with GridSearch was used to identify optimal hyperparameters for each metric evaluating the model at the lesion-level. Jaccard index (JI), F1-score (F1), and accuracy (acc) averaged across the folds for the training set and once for the test set were measured.
RESULTS: Data from 1566 brain metastases in 235 patients were included. Median age was 64 years (R:18-90) and 64% were female. Median KPS was 90 (R:50-100) and median # of lesions/patient was 9 (R:1-28). The RF model achieved the highest performance for DIF [acc=0.91, F1=0.94, JI=0.86] and LF [acc=0.94, F1=0.61, JI=0.43]. For RN, all models achieved similar acc (0.96) however, had low F1 and JI (JI=0.07-0.13 & F1=0.13-0.24), indicating that the models struggled to correctly identify the minority class (RN+), resulting in poor performance metrics despite high acc scores (Training:45 RN+/1207 RN-; Testing:11 RN+/303 RN-). Top 5 important features from RF model F1, for DIF were # of lesions, age, target max dose, PCI, and KPS and LF were # of lesions, target max dose, prescription isodose, and GTV size.
CONCLUSION: ML models demonstrated superior classification for DIF treated with SRS and contemporary systemic therapies, largely due to the greater number of DIF events. Limited data for LF and RN led to high acc but lower F1, underscoring the need for larger, more balanced datasets to enhance predictive accuracy in future clinical applications.