Evaluating the Performance of ChatGPT in Ophthalmology
Ophthalmology Science, ISSN: 2666-9145, Vol: 3, Issue: 4, Page: 100324
2023
- 227Citations
- 416Captures
- 3Mentions
Metric Options: CountsSelecting the 1-year or 3-year option will change the metrics count to percentiles, illustrating how an article or review compares to other articles or reviews within the selected time period in the same journal. Selecting the 1-year option compares the metrics against other articles/reviews that were also published in the same calendar year. Selecting the 3-year option compares the metrics against other articles/reviews that were also published in the same calendar year plus the two years prior.
Example: if you select the 1-year option for an article published in 2019 and a metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019. If you select the 3-year option for the same article published in 2019 and the metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019, 2018 and 2017.
Citation Benchmarking is provided by Scopus and SciVal and is different from the metrics context provided by PlumX Metrics.
Example: if you select the 1-year option for an article published in 2019 and a metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019. If you select the 3-year option for the same article published in 2019 and the metric category shows 90%, that means that the article or review is performing better than 90% of the other articles/reviews published in that journal in 2019, 2018 and 2017.
Citation Benchmarking is provided by Scopus and SciVal and is different from the metrics context provided by PlumX Metrics.
Metrics Details
- Citations227
- Citation Indexes226
- 226
- CrossRef63
- Policy Citations1
- Policy Citation1
- Captures416
- Readers416
- 364
- 52
- Mentions3
- News Mentions3
- News3
Most Recent News
ChatGPT may have a future use in glaucoma
(Image Credit: AdobeStock/Diego) Large language models (LLMs) show great promise in the realm of glaucoma with additional capabilities of self-correction, a recent study found.1 However,
Article Description
Foundation models are a novel type of artificial intelligence algorithms, in which models are pretrained at scale on unannotated data and fine-tuned for a myriad of downstream tasks, such as generating text. This study assessed the accuracy of ChatGPT, a large language model (LLM), in the ophthalmology question-answering space. Evaluation of diagnostic test or technology. ChatGPT is a publicly available LLM. We tested 2 versions of ChatGPT (January 9 “legacy” and ChatGPT Plus) on 2 popular multiple choice question banks commonly used to prepare for the high-stakes Ophthalmic Knowledge Assessment Program (OKAP) examination. We generated two 260-question simulated exams from the Basic and Clinical Science Course (BCSC) Self-Assessment Program and the OphthoQuestions online question bank. We carried out logistic regression to determine the effect of the examination section, cognitive level, and difficulty index on answer accuracy. We also performed a post hoc analysis using Tukey’s test to decide if there were meaningful differences between the tested subspecialties. We reported the accuracy of ChatGPT for each examination section in percentage correct by comparing ChatGPT’s outputs with the answer key provided by the question banks. We presented logistic regression results with a likelihood ratio (LR) chi-square. We considered differences between examination sections statistically significant at a P value of < 0.05. The legacy model achieved 55.8% accuracy on the BCSC set and 42.7% on the OphthoQuestions set. With ChatGPT Plus, accuracy increased to 59.4% ± 0.6% and 49.2% ± 1.0%, respectively. Accuracy improved with easier questions when controlling for the examination section and cognitive level. Logistic regression analysis of the legacy model showed that the examination section (LR, 27.57; P = 0.006) followed by question difficulty (LR, 24.05; P < 0.001) were most predictive of ChatGPT’s answer accuracy. Although the legacy model performed best in general medicine and worst in neuro-ophthalmology ( P < 0.001) and ocular pathology ( P = 0.029), similar post hoc findings were not seen with ChatGPT Plus, suggesting more consistent results across examination sections. ChatGPT has encouraging performance on a simulated OKAP examination. Specializing LLMs through domain-specific pretraining may be necessary to improve their performance in ophthalmic subspecialties. Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Bibliographic Details
http://www.sciencedirect.com/science/article/pii/S2666914523000568; http://dx.doi.org/10.1016/j.xops.2023.100324; http://www.scopus.com/inward/record.url?partnerID=HzOxMe3b&scp=85163557911&origin=inward; http://www.ncbi.nlm.nih.gov/pubmed/37334036; https://linkinghub.elsevier.com/retrieve/pii/S2666914523000568; https://dx.doi.org/10.1016/j.xops.2023.100324
Elsevier BV
Provide Feedback
Have ideas for a new metric? Would you like to see something else here?Let us know