INTRODUCTION: Artificial intelligence (AI) is revolutionizing medical education, with large language models (LLMs) such as ChatGPT-4 (OpenAI) and Google Gemini (Google AI) increasingly used as learning tools. This study examines ChatGPT-4 and Google Gemini’s accuracy in answering board-level psychiatry examination questions and classifying question difficulty.
METHODS: This cross-sectional study evaluated ChatGPT-4 and Google Gemini using 993 validated board-style psychiatry questions from BoardVitals. AI models were tested using standardized prompts, and their responses were analyzed for accuracy and difficulty classification.
RESULTS: Both ChatGPT-4 and Google Gemini demonstrated high accuracy, significantly surpassing the peer benchmark of 75.95% (p<0.001). No statistically significant difference was found between the models in overall accuracy (ChatGPT-4: 90.4%, Google Gemini: 90.8%; p=0.658). Both models exhibited only fair agreement with BoardVitals' difficulty categorizations, with ChatGPT-4 (κw=0.373) and Gemini (κw=0.30) frequently underestimating difficult questions.
DISCUSSION AND CONCLUSION: ChatGPT-4 and Google Gemini show high accuracy in answering psychiatry board-style questions, highlighting their potential as adjunctive tools in medical education. However, their limitations in higher-order reasoning and difficulty classification underscore the need for further refinement. Future research should explore AI integration into real-world clinical decision-making while ensuring human oversight to maintain reliability and ethical considerations.