- Jin, Qiao;
- Chen, Fangyuan;
- Zhou, Yiliang;
- Xu, Ziyang;
- Cheung, Justin;
- Chen, Robert;
- Summers, Ronald;
- Rousseau, Justin;
- Ni, Peiyun;
- Landsman, Marc;
- Baxter, Sally;
- AlAref, Subhi;
- Li, Yijia;
- Chen, Alexander;
- Brejt, Josef;
- Chiang, Michael;
- Peng, Yifan;
- Lu, Zhiyong
Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4Vs rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges-an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V performs comparatively to human physicians regarding multi-choice accuracy (81.6% vs. 77.8%). GPT-4V also performs well in cases where physicians incorrectly answer, with over 78% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (35.5%), most prominent in image comprehension (27.2%). Regardless of GPT-4Vs high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such multimodal AI models into clinical workflows.