Introduction
Machine learning applications have brought about significant advancements in medicine, including dentistry (Ducret et al. 2022, Schwendicke et al. 2022, Haug & Drazen 2023). Among these advancements, a notable development has been the emergence of large language models (LLMs) with a conversational interface, such as ChatGPT, Bard, Baidu’s Ernie Bot, Claude 2, Llama 2, and the chatbot function of the revamped Bing search engine.
These LLMs, underpinned by deep learning transformer architectures, are trained on vast amounts of tokenized text data (Vaswani et al. 2017). This training allows them to generate fluent, contextually pertinent responses based on the input they receive. Their capabilities span a wide range of tasks, from answering questions, summarizing texts, translating languages, to writing computer code.
Priming, the practice of providing LLMs with initial, contextually relevant information, is a useful approach to enhance their output quality (Raffel et al. 2020). By initiating a conversation with strategically chosen words, phrases, or longer text excerpts, users can guide LLMs to produce more accurate and contextually congruous responses.
LLMs have many potential use cases in healthcare, including dentistry (Eggmann et al. 2023). For instance, healthcare professionals could soon leverage LLMs to streamline routine administrative tasks and improve patient education. However, LLMs come with a set of significant risks and some inherent limitations (Mello & Guha 2023). Many LLMs operate with knowledge cutoffs, which means they lack up-to-date information (Dashti et al. 2023). Determining the reliability of their response sources can be difficult, if not impossible (Walker et al. 2023). Moreover, LLMs sometimes produce answers that seem plausible but are incorrect, underscoring the importance of human oversight (Dashti et al. 2023). Given these limitations, there are serious concerns regarding the utility and safety of LLMs, especially in high-stakes fields of application such as healthcare (Beam et al. 2023).
In light of the potential implications of LLMs for healthcare, rigorous evaluation of their outputs is paramount. By assessing LLMs’ performance against external benchmarks—including reasoning, coding, and knowledge tests—one can discern their strengths and weaknesses (Kung et al. 2023). Such evaluations can then inform strategies to enhance LLMs’ performance and guard against incautious use.
A prime resource for such evaluations is the University of Bern’s Institute for Medical Education (IML). The IML hosts a digital platform offering a vast array of self-assessment questions tailored for dental and medical students and healthcare professionals (https://self-assessment.measured.iml.unibe.ch/). Among its offerings are multiple-choice questions designed for dental students preparing for the Swiss Federal Licensing Examination in Dental Medicine (SFLEDM) and allergists and immunologists preparing for the European Examination in Allergy and Clinical Immunology (EEAACI). These curated question banks present an ideal tool for assessing and comparing the performance of ChatGPT across distinct medical fields.
Considering the importance of examining the output accuracy of LLMs, this study pursues two objectives. First, it aims to compare the performance of ChatGPT 3 and ChatGPT 4 in responding to the SFLEDM and EEAACI self-assessment questions. Second, it seeks to evaluate the impact of priming on ChatGPT's performance in these assessments.
Materials and Methods
Input sources
The SFLEDM and EEAACI self-assessment questions were obtained from the IML platform on February 13, 2023. While SFLEDM questions were translated from German to English, EEAACI questions were already available in English. Any questions with images or illustrations were excluded. The questions were of two multiple-choice formats:
- A-type questions: These comprised a stem (either a question or a case scenario) followed by potential answers. The task was to identify the single most appropriate answer. Within the SFLEDM and EEAACI self-assessments, these questions had four and five options, respectively.
- Kprim-type questions: These also started with a stem, succeeded by four related statements or answers. The task was to determine the correctness of each of these statements or answers.
The study used 32 SFLEDM questions, comprising 22 A-type and 10 Kprim-type questions. In total, 28 EEAACI questions were used, comprising 19 A-type and 9 Kprim-type questions. The terms of service of the IML platform restrict the dissemination of these self-assessment questions, even though they are publicly accessible at https://self-assessment.measured.iml.unibe.ch/ (last accessed on October 3, 2023). They are therefore not featured in this report.
Priming
The primers, utilized to provide context for the questions, encompassed details about the respective test, main subject information with relevant keywords, as well as information about the question format and response guidelines. They offered a thorough overview of the examination, including insights into the organizing body, exam purpose, and covered topics, while also instructing the use of scientific reasoning and adherence to general guidelines of the respective field for answering questions.
Designed to be analogous in length, structure, style, and content, the primers for the SFLEDM and EEAACI self-assessments underwent several optimization rounds using ChatGPT 3, adhering to principles of effective prompt design. Each trial for the primed groups consistently utilized the same primer.
Conversely, the non-primed groups received a prompt that only supplied basic information about the question format and response guidelines, deliberately omitting context about the examination or topics to maintain succinctness and avoid priming.
Supplementary Table S-I details the input texts used for both primed and non-primed groups prior to administering the multiple-choice questions.
Administering questions to ChatGPT
The tests involving ChatGPT 3 and ChatGPT 4 took place on February 19, 2023, and March 25, 2023, respectively. For each group, 20 trials were conducted. Before initiating each trial, the entire chat history was cleared. A new chat window was then opened to eliminate any potential context carryover. For the non-primed groups, the input prompt contained brief instructions on answering the questions, followed by either the A-type or Kprim-type questions. In contrast, for the primed groups, the primer was introduced before presenting the questions.
Performance assessment
An unblinded investigator recorded ChatGPT's responses in a pilot-tested spreadsheet. For A-type questions, a score of 1 point was given for correct answers and 0 points for incorrect ones. For Kprim-type questions, correctly answering all four related statements or answers earned 1 point. If three out of the four statements or answers were evaluated correctly, 0.5 points were given. A score of 0 points was assigned if fewer than three statements or answers were correctly evaluated.
Statistical analysis
For each trial, the attained points were presented as a percentage of the maximum possible points. This percentage was chosen as a performance metric since the maximum points varied between the SFLEDM and EEAACI self-assessments. Descriptive statistics, including mean, standard deviation, median, and interquartile range, were computed for each group. Analysis of the distribution within each group revealed a non-normal distribution, verified using a graphical method (normal probability plot). Performance was analyzed between primed and non-primed groups for both the SFLEDM and EEAACI self-assessments. This comparison was made within each ChatGPT version, as well as across the ChatGPT 3 and ChatGPT 4 subsets. The Wilcoxon rank sum test was used for this analysis.
To assess the impact of priming, the improvement due to priming was calculated for both the SFLEDM and EEAACI self-assessments within the ChatGPT 3 and ChatGPT 4 subsets. To quantify the improvement, trials—both without and with priming—were ranked within their respective groups based on the percentage of the maximum points attained. These ranks were paired before subtracting the values of the non-primed groups from the values of the primed group, producing 20 improvement values within each group. Analysis utilizing a normal probability plot confirmed a non-normal distribution of data. Consequently, the Wilcoxon rank sum test was used for group comparisons. The level of significance was set at alpha=0.05. The statistical analyses were performed by an unblinded investigator using R software (version 4.2.2, R Core Team, R Foundation for Statistical Computing, Vienna, Austria). The dataset generated and analyzed in this study is available in an open repository (Fuchs et al. 2023).
Results
Table I and Figure 1 present the detailed results. Both ChatGPT 3 and ChatGPT 4 exhibited superior performance in the EEAACI compared with the SFLEDM assessment (p<0.001). Overall, ChatGPT 4 scored higher than ChatGPT 3 across all groups (p<0.001). The performance gap between ChatGPT 4 and ChatGPT 3 was wider in the EEAACI assessment than in the SFLEDM assessment. In the SFLEDM assessment, without and with priming, the average percentage point increases for ChatGPT 4 over ChatGPT 3 were 5.1 and 4.1, respectively. In contrast, for the EEAACI assessment, these increases were 18.2 (without priming) and 15.0 (with priming).
Priming significantly enhanced the performance of ChatGPT 3 in both the SFLEDM (p=0.012) and EEAACI (p=0.001) assessments. For ChatGPT 4, while there was a significant performance increase in the SFLEDM assessment due to priming (p=0.03), priming had no significant effect on the performance in the EEAACI assessment (p=0.221).
As shown in Table II, with ChatGPT 3, priming enhanced the performance for EAACI more than for SFLEDM (p=0.037). Conversely, when using ChatGPT 4, priming improved the performance for SFLEDM performance more than for EEAACI (p=0.002).
Discussion
This study compared ChatGPT 3's and ChatGPT 4's performance on SFLEDM and EEAACI self-assessment questions. These multiple-choice questions served to benchmark and contrast the LLMs' proficiency in the field of dentistry and allergy and immunology. The results showed that both versions performed better on the EEAACI, with ChatGPT 4 surpassing ChatGPT 3 in all tests. Priming notably improved ChatGPT 3's performance in both tests, but only impacted ChatGPT 4 in the SFLEDM assessment.
The observed performance disparity between the EEAACI and SFLEDM assessments suggests that ChatGPT's proficiency may vary across all medical specialties. One plausible explanation for this disparity may lie in the nature of the data the LLM has been trained on (Patcas et al. 2022, Bornstein 2023, Walker et al. 2023). Most of the medical literature, discussions, and queries available in open sources focus on broader medical fields, with allergy and immunology being more extensively represented than smaller branches of medicine such as dentistry. Furthermore, in dentistry, diagnoses and treatments frequently rely heavily on physical examinations and imaging, aspects that textual models such as ChatGPT are not adept at grasping. By contrast, allergy and immunology, being more systemic and often reliant on patient history and laboratory results, are better suited for textual analysis and understanding by LLMs.
This study, while specifically pertaining to the SFLEDM and EEAACI assessments, may offer broader implications for the application of LLMs in other medical domains. The observed performance disparities and the impact of priming across different assessments suggest that the effectiveness of LLMs can be significantly influenced by the subject matter. Extending this to other domains, it becomes pivotal to consider the availability and specificity of training data, as well as the inherent characteristics of the medical field in question. For instance, medical specialties that heavily rely on textual information and have abundant data available might observe better LLM performance, akin to the results seen in the EEAACI assessment. Conversely, fields that depend more on visual or practical elements may present additional challenges for LLMs, as seen in the SFLEDM assessment. Further research is warranted to explore these dynamics, identifying patterns and strategies to optimize LLMs’ performance across diverse medical specialties.
ChatGPT's performance has been studied across various medical knowledge examinations, with the accuracy rates demonstrating considerable variation among different tests and medical disciplines. A recent systematic review and meta-analysis revealed that the performance range for ChatGPT 3.5 in these evaluations spanned from 40% to 100%, with an average accuracy rate of 61.1% (Levin et al. 2023). The mean performance of 63.3% in the SFLEDM assessment aligns with this average across medical domains. In contrast, ChatGPT's performance in the EEAACI assessment yielded a higher average score of 79.3%, placing it at the top range compared with results from other studies (Levin et al. 2023). It is noteworthy that ChatGPT 4, when primed, exceeded the commonly used passing threshold of 60% in the SFLEDM assessment. This level of performance was observed in the EEAACI assessment across the two examined ChatGPT iterations, regardless of priming.
In dental and medical education, LLM chatbots hold potential for supplementing learning materials and providing interactive learning opportunities (Ali et al. 2023). However, it is important to emphasize that while ChatGPT 3 and ChatGPT 4 showed promise in answering self-assessment questions from the SFLEDM and EEAACI, they should not be relied on for exam preparation. Despite their capabilities, these LLMs frequently provide inaccurate or misleading information (Mello & Guha 2023). Relying on ChatGPT for exam preparation, especially in critical fields like healthcare, could lead to misconceptions and an incomplete understanding of the subject matter (Levin et al. 2023, Saad et al. 2023). Therefore, it is crucial to always approach their outputs with caution and cross-reference with trusted educational resources. Today, LLMs should serve merely as supplements to more traditional methods of information seeking (Mello & Guha 2023).
The noticeable performance improvement from ChatGPT 3 to ChatGPT 4, available exclusively to subscription fee payers, underscores the rapid advancements in LLM development within a short period. Comparable enhancements in response quality from ChatGPT 4 have been noted for queries related to dermatology and myopia (Lewandowski et al. 2023, Lim et al. 2023). Moreover, while ChatGPT 3 operated solely with text, ChatGPT 4 is multimodal, allowing it to accept and produce text and image inputs and outputs. This shift to multimodality represents a substantial enhancement in ChatGPT's functionality. The increasing adaptability of LLMs suggests that they might soon serve as additional tools in specific use cases in healthcare (Vaira et al. 2023).
However, on the road to artificial general intelligence, LLMs underpinned by the next-token-prediction paradigm are likely an off-ramp (Marcus 2022). Their capabilities, based on brute statistics, are impressive, but their genuine understanding remains shallow (Thirunavukarasu 2023). Medical professionals, including allergists, immunologists, and dentists, are therefore not predicted to face major changes due to the widespread adoption of LLM applications (Thirunavukarasu 2023).
Priming and adept prompt design serve as strategic tools to guide LLMs towards generating more contextually congruous responses (Raffel et al. 2020). The results of this study are in line with this assertion, particularly with ChatGPT 3, where priming significantly enhanced its performance in both the SFLEDM and EEAACI assessments. However, whereas priming exhibited a significant impact on ChatGPT 4's performance in the SFLEDM assessment, its influence was negligible in the EEAACI assessment. This difference underscores the evolving nature of LLMs and suggests that as these models become more advanced, the relative impact of priming may vary depending on the complexity and specificity of the task at hand.
This study has several limitations that warrant careful consideration. First, the questions from the IML platform, specifically the SFLEDM and EEAACI self-assessments, represented only a narrow spectrum of knowledge within dentistry, allergy, and immunology. This limits the generalizability of the findings.
Second, tasks like answering board examination questions or retrieving information from medical records have only a tangential connection to real-world care decisions (Mello & Guha 2023). This means that assessments using such tasks as benchmarks offer limited insight into a LLM's usefulness for clinical decision support (Mello & Guha 2023).
Third, the translation of SFLEDM questions from German to English introduced potential biases, as nuances in language might affect the LLM's comprehension and response accuracy.
Fourth, the exclusion of questions with images or illustrations omits a significant aspect of medical assessments, which often rely on visual diagnostics and the interpretation of data charts and graphs.
Fifth, an unblinded evaluator recorded and graded ChatGPT’s responses to the multiple-choice questions. Since the answer key for these questions was objective and definitive, allowing no room for interpretation or discretion, calibration procedures, evaluator blinding, and employment of multiple evaluators were foregone. Nonetheless, to guard against potential biases inherent in unblinded assessments—even when utilizing unequivocal answer keys—future investigations should consider implementing evaluator calibration and blinding.
Sixth, by focusing solely on two versions of ChatGPT, the study did not capture the full range of LLM capabilities across various models or iterations. These limitations emphasize the critical need for additional research to thoroughly evaluate the performance and potential impact of LLMs in medical disciplines.
Conclusion
Within the constraints of this study, the following conclusions were drawn:
- ChatGPT 3 and ChatGPT 4 both demonstrated stronger performance on the EEAACI compared with the SFLEDM assessment. This performance disparity highlights ChatGPT's varying proficiency across different medical domains, likely influenced by the type and volume of training data available in each field.
- Priming improved ChatGPT 3's performance across both assessments. For ChatGPT 4, while priming influenced results in the SFLEDM assessment, its effect was negligible for the EEAACI. This underscores the nuanced influence of priming as LLMs become more advanced.
- The progress from ChatGPT 3 to ChatGPT 4 reveals rapid advancements in LLM development, including the shift to multimodality. Yet, their enhanced capabilities notwithstanding, LLMs have major inherent limitations and risks, emphasizing the need for cautious use in high-stakes fields such as healthcare.
Zusammenfassung
Einleitung
Anwendungen der künstlichen Intelligenz (KI) können dem Gesundheitspersonal, einschliesslich Zahnärzten, verschiedene Vorteile bieten. Grosse Sprachmodelle (GSM) sind KI-Anwendungen, die mit grossen Mengen von Textdaten trainiert werden und verschiedene sprachbezogene Aufgaben durchführen können. ChatGPT, ein GSM mit einer Konversationsschnittstelle, wurde im November 2022 auf den Markt gebracht und ist online verfügbar. Trotz seiner beeindruckenden Fähigkeiten hat ChatGPT erhebliche Einschränkungen und Unzulänglichkeiten. Beispielsweise gibt ChatGPT teilweise fehlerhafte Antworten oder stellt Fehlinformationen als Fakten dar. Vor der Anwendung GSM in medizinischen Disziplinen ist es von grosser Bedeutung, die Fähigkeiten und Grenzen von GSM zu verstehen. Ein interessanter Ansatz ist das "Priming", bei einem GSM vorab relevante Informationen gegeben werden, um die Qualität seiner Antworten zu verbessern. Diese Studie konzentriert sich auf die Bewertung der Leistung von ChatGPT Versionen 3 und 4 in den medizinischen Bereichen Zahnmedizin sowie Allergologie und klinische Immunologie, unter besonderer Berücksichtigung des Priming-Effekts.
Material und Methoden
Zur umfassenden Evaluation von ChatGPT wurden Multiple-Choice-Fragen zur Selbstbewertung in Zahnmedizin («Swiss Federal Licensing Examination in Dental Medicine» [SFLEDM]) und Allergologie sowie klinischer Immunologie («European Examination in Allergy and Clinical Immunology» [EEAACI]) vom Institut für Medizinische Lehre der Universität Bern zusammengestellt. ChatGPT 3 und 4 wurden unter zwei Bedingungen getestet: mit Priming und ohne Priming. Das Hauptkriterium für die Leistungsbewertung war die Genauigkeitsrate, gemessen an der Anzahl korrekt beantworteter Fragen. Die statistischen Analysen erfolgten mittels Wilcoxon-Rangsummentests mit einem Signifikanzniveau von alpha = 0,05.
Resultate
Im SFLEDM-Bereich betrug die durchschnittliche Genauigkeitsrate 63,3%. Im Gegensatz dazu zeigte ChatGPT im EEAACI-Bereich mit einer durchschnittlichen Genauigkeit von 79,3% eine überlegene Leistung. Beide ChatGPT-Modelle zeigten im EEAACI-Bereich bessere Leistungen als im SFLEDM-Bereich. Bemerkenswert ist, dass ChatGPT 4 durchgehend bessere Leistungen als ChatGPT 3 in beiden Bereichen zeigte. In Bezug auf das Priming zeigte ChatGPT 3 sowohl bei den Fragen aus dem EEAACI Bereich (p=0,001) als auch im SFLEDM Bereich (p=0,012) eine deutliche Verbesserung bei Verwendung von Priming. Im Gegensatz dazu verbesserte sich die Leistung durch Priming bei ChatGPT 4 nur im SFLEDM-Bereich signifikant (p=0,03).
Diskussion
Die unterschiedliche Leistung von ChatGPT in der Beantwortung von Multiple-Choice-Fragen aus dem SFLEDM und EEAACI Bereich weist auf eine unterschiedliche Kompetenz von GSM in verschiedenen medizinischen Bereichen hin. Diese unterschiedliche Kompetenz könnte durch die Art und das Volumen der verfügbaren Trainingsdaten für jeden Bereich beeinflusst werden. Priming erweist sich als vorteilhafte Methode zur Leistungsverbesserung von GSM, besonders bei älteren Versionen wie ChatGPT 3. Der signifikante Leistungszuwachs von ChatGPT 3 zu 4 unterstreicht die rasanten Entwicklungen in der GSM-Technologie. Dennoch ist beim Einsatz von GSM im Gesundheitssektor, einschliesslich der Zahnmedizin, höchste Sorgfalt und Umsicht angebracht, denn GSM weisen weiterhin zahlreiche Limitationen und Risiken auf.
Résumé
Introduction
Les applications d'intelligence artificielle (IA) peuvent offrir divers avantages aux professionnels de la santé, y compris aux dentistes. Les modèles de langage de grande taille (abrégé LLM de l'anglais large language model) sont des applications d'IA entraînées avec de grandes quantités de données textuelles et capables d'effectuer différentes tâches liées à la langue. ChatGPT, un LLM doté d'une interface conversationnelle, a été lancé en novembre 2022 et est disponible en ligne. Malgré ses capacités impressionnantes, ChatGPT présente des limitations et des insuffisances importantes. Par exemple, ChatGPT donne parfois des réponses erronées ou présente des informations erronées comme des faits. En raison de la nature critique des disciplines médicales, il est très important de comprendre les capacités et les limites du LLM. Une approche intéressante est le "priming", qui consiste à donner au LLM des informations pertinentes à l'avance afin d'améliorer la qualité de ses réponses. Cette étude se concentre sur l'évaluation des performances de ChatGPT versions 3 et 4 dans les domaines médicaux de la dentisterie ainsi que de l'allergologie et de l'immunologie clinique, en accordant une attention particulière à l'effet d'amorçage.
Matériels et methodes
Pour une évaluation complète de ChatGPT, des questions à choix multiples d'auto-évaluation en médecine dentaire («Swiss Federal Licensing Examination in Dental Medicine» [SFLEDM]) et en allergologie et immunologie clinique («European Examination in Allergy and Clinical Immunology» [EEAACI]) ont été compilées par l'Institut d'enseignement médical de l'Université de Berne. ChatGPT 3 et 4 ont été testés dans deux conditions : avec et sans amorçage. Le principal critère d'évaluation des performances était le taux de précision, mesuré par le nombre de questions auxquelles il a été répondu correctement. Les analyses statistiques ont été effectuées à l'aide de tests de répartition des rangs de Wilcoxon avec un niveau de signification de alpha = 0,05.
Résultats
Dans le domaine SFLEDM, le taux de précision moyen était de 63,3%. En revanche, ChatGPT a montré une performance supérieure dans le domaine EEAACI, avec une précision moyenne de 79,3%. Les deux modèles ChatGPT ont montré de meilleures performances dans le domaine EEAACI que dans le domaine SFLEDM. Il est à noter que ChatGPT 4 a montré des performances systématiquement meilleures que ChatGPT 3 dans les deux domaines. En ce qui concerne l'amorçage, ChatGPT 3 a montré une nette amélioration lors de l'utilisation de l'amorçage, tant pour les questions du domaine EEAACI (p=0,001) que pour le domaine SFLEDM (p=0,012). En revanche, la performance de ChatGPT 4 ne s'est améliorée de manière significative par l'amorçage que dans le domaine SFLEDM (p=0,03).
Discussion
La différence de performance de ChatGPT dans les réponses aux questions à choix multiples des domaines SFLEDM et EEAACI pourrait indiquer une différence de compétence des LLMs dans différents domaines médicaux. Cette différence de compétence pourrait être influencée par le type et le volume des données d'entraînement disponibles pour chaque domaine. L'amorçage s'avère être une méthode avantageuse pour améliorer les performances du LLM, en particulier pour les anciennes versions comme ChatGPT 3. L'augmentation significative des performances de ChatGPT 3 à 4 souligne les développements rapides de la technologie LLM. Toutefois, l'utilisation du LLM dans le secteur de la santé, y compris la médecine dentaire, requiert la plus grande prudence et le plus grand soin. En effet, les LLMs présentent encore de nombreuses limites et risques.