Abstract
Evaluation of alignment between large language models and expert clinicians in suicide risk assessment.
McBain, R., Cantor, J., Zhang, L., Baker, O., Zhang, F., Burnett, A., Kofner, A., Breslau, J., Stein, B., Mehrotra, A. & Yu, H.
Objective: This study aimed to evaluate whether three popular
chatbots powered by large language models (LLMs)—
ChatGPT, Claude, and Gemini—provided direct responses
to suicide-related queries and how these responses aligned
with clinician-determined risk levels for each question.
Methods: Thirteen clinical experts categorized 30 hypothetical
suicide-related queries into five levels of self-harm
risk: very high, high, medium, low, and very low. Each
LLM-based chatbot responded to each query 100 times
(N=9,000 total responses). Responses were coded as “direct”
(answering the query) or “indirect” (e.g., declining to
answer or referring to a hotline). Mixed-effects logistic regression
was used to assess the relationship between question
risk level and the likelihood of a direct response.
Results: ChatGPT and Claude provided direct responses to
very-low-risk queries 100% of the time, and all three chatbots
did not provide direct responses to any very-high-risk
query. LLM-based chatbots did not meaningfully distinguish
intermediate risk levels. Compared with very-low-risk
queries, the odds of a direct response were not statistically
different for low-risk, medium-risk, or high-risk queries.
Across models, Claude was more likely (adjusted odds
ratio [AOR]=2.01, 95% CI=1.71–2.37, p<0.001) and Gemini
less likely (AOR=0.09, 95% CI=0.08–0.11, p<0.001) than
ChatGPT to provide direct responses.
Conclusions: LLM-based chatbots’ responses to queries
aligned with experts’ judgment about whether to respond
to queries at the extremes of suicide risk (very low and very
high), but the chatbots showed inconsistency in addressing
intermediate-risk queries, underscoring the need to further
refine LLMs.