Search Articles

View query in Help articles search

Search Results (1 to 10 of 2506 Results)

Download search results: CSV END BibTex RIS


Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study

For example, a recent study found that GPT-4’s clinical scenario responses are influenced by societal biases, causing it to recommend erroneous diagnoses and management plans based on factors such as race and gender [15]. Other studies have consistently shown that LLMs may misinterpret specialized terminology (eg, “egosyntonic”) within domain-specific text [16,17].

Kaitlin Hanss, Karthik V Sarma, Anne L Glowinski, Andrew Krystal, Ramotse Saunders, Andrew Halls, Sasha Gorrell, Erin Reilly

J Med Internet Res 2025;27:e69910

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

Our study corroborates GPT-4’s strong performance, particularly in psychiatry, where GPT-4o achieved 84.4% accuracy. However, our findings suggest that more cautious interpretation is needed, given the high confidence levels observed for incorrect answers. Xiong et al’s [17] work on LLM confidence elicitation aligns with our observations of overconfidence.

Mahmud Omar, Reem Agbareia, Benjamin S Glicksberg, Girish N Nadkarni, Eyal Klang

JMIR Med Inform 2025;13:e66917