Search Articles

View query in Help articles search

Search Results (1 to 10 of 2088 Results)

Download search results: CSV END BibTex RIS


Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

Our study corroborates GPT-4’s strong performance, particularly in psychiatry, where GPT-4o achieved 84.4% accuracy. However, our findings suggest that more cautious interpretation is needed, given the high confidence levels observed for incorrect answers. Xiong et al’s [17] work on LLM confidence elicitation aligns with our observations of overconfidence.

Mahmud Omar, Reem Agbareia, Benjamin S Glicksberg, Girish N Nadkarni, Eyal Klang

JMIR Med Inform 2025;13:e66917