ISSN: 2630-5720 | E-ISSN: 2687-346X
Evaluating AI in Psychiatry Board Exams: A Comparative Study of ChatGPT-4 and Google Gemini [Haydarpasa Numune Med J]
Haydarpasa Numune Med J. 2025; 65(2): 165-173 | DOI: 10.14744/hnhj.2025.48154

Evaluating AI in Psychiatry Board Exams: A Comparative Study of ChatGPT-4 and Google Gemini

Ipek Özönder Ünal1, Hafize Miray Aytaç2
1Department of Psychiatry, Tuzla State Hospital, Istanbul, Türkiye
2Department of Psychiatry, Sancaktepe Şehit Prof.Dr. İlhan Varank Training and Research Hospital, Istanbul, Türkiye

INTRODUCTION: Artificial intelligence (AI) is revolutionizing medical education, with large language models (LLMs) such as ChatGPT-4 (OpenAI) and Google Gemini (Google AI) increasingly used as learning tools. This study examines ChatGPT-4 and Google Gemini’s accuracy in answering board-level psychiatry examination questions and classifying question difficulty.
METHODS: This cross-sectional study evaluated ChatGPT-4 and Google Gemini using 993 validated board-style psychiatry questions from BoardVitals. AI models were tested using standardized prompts, and their responses were analyzed for accuracy and difficulty classification.
RESULTS: Both ChatGPT-4 and Google Gemini demonstrated high accuracy, significantly surpassing the peer benchmark of 75.95% (p<0.001). No statistically significant difference was found between the models in overall accuracy (ChatGPT-4: 90.4%, Google Gemini: 90.8%; p=0.658). Both models exhibited only fair agreement with BoardVitals' difficulty categorizations, with ChatGPT-4 (κw=0.373) and Gemini (κw=0.30) frequently underestimating difficult questions.
DISCUSSION AND CONCLUSION: ChatGPT-4 and Google Gemini show high accuracy in answering psychiatry board-style questions, highlighting their potential as adjunctive tools in medical education. However, their limitations in higher-order reasoning and difficulty classification underscore the need for further refinement. Future research should explore AI integration into real-world clinical decision-making while ensuring human oversight to maintain reliability and ethical considerations.

Keywords: Academic performance, artificial intelligence, psychiatry.

Corresponding Author: Ipek Özönder Ünal, Türkiye
Manuscript Language: English
×
APA
NLM
AMA
MLA
Chicago
Copied!
CITE
LookUs & Online Makale