Evaluating the Precision and Dependability of Medical Answers Generated by ChatGPT

Evaluating the Precision and Dependability of Medical Answers Generated by ChatGPT

*Zain Abidin
Cooper Medical School, Rowan University, USA

*Corresponding address: Cooper Medical School, Rowan University, USA
Email: zainwildelake@gmail.com

Keywords: Artificial Intelligence; Assessment; Decision Making; Healthcare



This study aims to assess the accuracy and depth of ChatGPT’s responses to medical questions posed by physicians, providing preliminary evidence of its reliability in offering precise and comprehensive information. Furthermore, the study will shed light on the limitations inherent in AI-generated medical advice.


This research involved 10 physicians formulating questions for ChatGPT without patient-specific data. Approximately 29% of the 35 invited doctors participated, creating eight questions each. The questions covered easy, medium, and hard levels, with yes/no or descriptive responses. ChatGPT’s responses were evaluated by physicians for accuracy and completeness using established Likert scales. An internal validation re-submitted questions with low accuracy scores, and statistical measures analyzed the outcomes, revealing insights into response consistency and variation over time.


The analysis of 80 ChatGPT-generated answers revealed a median accuracy score of 4 (mean 4.7, SD 2.6) and a median completeness score of 2 (mean 1.8, SD 1.5). Notably, 30% of responses achieved the highest accuracy score (6), and 38.7% were rated nearly all correct (5), while 8% were deemed completely incorrect (1). Inaccurate answers were more common for physician-rated hard questions. Completeness varied, with 45% considered comprehensive, 37.5% adequate, and 17.5% incomplete. Modest correlation (Spearman’s r = 0.3) existed between accuracy and completeness across all questions.


Integrating language models like ChatGPT in medical practice shows promise, but cautious considerations are crucial for safe use. While AI-generated responses display commendable accuracy and completeness, ongoing refinement is needed for reliability. This research lays a foundation for AI integration in healthcare, underscoring the importance of continuous evaluation and regulatory measures to ensure safe and effective implementation.

Read More: PDF File

How to Cite this: Abidin Z. Evaluating the Precision and Dependability of Medical Answers Generated by ChatGPT. J Sci Technol Educ Art Med. 2024;1(1):13-20

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.