Prediction of suicidal behaviors in the middle-aged population: Machine learning analyses of UK Biobank
Wang, J., Qiu, J., Zhu, T., Zeng, Y., Yang, H., Shang, Y., ... & Song, H.
Background: Suicidal behaviors, including suicide deaths and attempts, are major public health concerns. However, previous suicide models required a huge amount of input features, resulting in limited applicability in clinical practice. Objective: We aimed to construct applicable models (ie, with limited features) for short- and long-term suicidal behavior prediction. We further validated these models among individuals with different genetic risks of suicide. Methods: Based on the prospective cohort of UK Biobank, we included 223 (0.06%) eligible cases of suicide attempts or deaths, according to hospital inpatient or death register data within 1 year from baseline and randomly selected 4460 (1.18%) controls (1:20) without such records. We similarly identified 833 (0.22%) cases of suicidal behaviors 1 to 6 years from baseline and 16,660 (4.42%) corresponding controls. Based on 143 input features, mainly including sociodemographic, environmental, and psychosocial factors; medical history; and polygenic risk scores (PRS) for suicidality, we applied a bagged balanced light gradient-boosting machine (LightGBM) with stratified 10-fold cross-validation and grid-search to construct the full prediction models for suicide attempts or deaths within 1 year or between 1 and 6 years. The Shapley Additive Explanations (SHAP) approach was used to quantify the importance of input features, and the top 20 features with the highest SHAP values were selected to train the applicable models. The external validity of the established models was assessed among 50,310 individuals who participated in UK Biobank repeated assessments both overall and by the level of PRS for suicidality. Results: Individuals with suicidal behaviors were on average 56 years old, with equal sex distribution. The application of these full models in the external validation data set demonstrated good model performance, with the area under the receiver operating characteristic (AUROC) curves of 0.919 and 0.892 within 1 year and between 1 and 6 years, respectively. Importantly, the applicable models with the top 20 most important features showed comparable external-validated performance (AUROC curves of 0.901 and 0.885) as the full models, based on which we found that individuals in the top quintile of predicted risk accounted for 91.7% (n=11) and 80.7% (n=25) of all suicidality cases within 1 year and during 1 to 6 years, respectively. We further obtained comparable prediction accuracy when applying these models to subpopulations with different genetic susceptibilities to suicidality. For example, for the 1-year risk prediction, the AUROC curves were 0.907 and 0.885 for the high (>2nd tertile of PRS) and low (<1st) genetic susceptibilities groups, respectively. Conclusions: We established applicable machine learning–based models for predicting both the short- and long-term risk of suicidality with high accuracy across populations of varying genetic risk for suicide, highlighting a cost-effective method of identifying individuals with a high risk of suicidality.