Study compares accuracy of ChatGPT-3.5 and GPT-4 in diagnosing dark skin

New evidence suggests that there is no significant difference in the accuracy of artificial intelligence (AI) models between GPT-3.5 and GPT-4 in assessing patient populations with and without skin of color based on examination or histopathology. Both models have a high diagnostic accuracy rate of 72% to 100%.¹

These data were included in a new research letter that first pointed to the growing literature on the implementation of ChatGPT in the dermatology field due to the application’s ability to generate human-like text responses to user input.

The researchers – led by Simal Qureshi of the Faculty of Medicine at Memorial University of Newfoundland in Canada – noted that the application offers both a standard model (GPT-3.5) and a premium version (GPT-4), the latter offering greater processing capacity. Qureshi et al. pointed to the widespread belief that AI datasets often do not include cases of skin color.²

“However, there are no studies that have examined the accuracy of this model in providing clinical information on (skin of color), which could be a valuable tool for clinicians and medical students,” Qureshi and colleagues wrote. “We therefore wanted to understand the accuracy of ChatGPT in diagnosing dermatologic diseases in both (skin of color) and non-(skin of color) cases.”¹

Study design

The research team evaluated a total of 29 cases, with 14 cases taken from a general dermatology textbook for patients in the non-colored skin group. They took 15 of these cases from a dermatology textbook for patients with colored skin and assigned them to this group for the purpose of the study.

The cases evaluated by the team included a variety of skin conditions across both cohorts. The medical histories and physical examination details of each case were entered into the GPT-3.5 and GPT-4 models by the investigators, using medical terminology, and the AI was asked questions to determine the three most important differential diagnoses.

If the research team had additional diagnostic data, such as imaging or laboratory tests, they added it. A request was then made to the AI application requesting that it provide a final diagnosis.

Histopathological reports were generated for 12 cases of coloured skin and were also entered into the system to provide a final diagnosis of the disease.

Later, chi-square tests were performed to evaluate and compare the diagnostic accuracy of GPT-3.5 and GPT-4 within the black and non-black cohorts.

ChatGPT-3.5 compared to GPT-4

The age and gender distribution in both study groups was similar. However, it was found that the number of words was higher in the non-skinned cases than in the skinned cases (251.4 versus 145.9 words). P= .01). Despite this fact, this did not correlate with the accuracy of the two models in diagnosing patients (correlation coefficients r = .11 and .26, respectively).

Overall, the researchers found no significant difference between GPT-3.5 and GPT-4 in terms of accuracy in formulating differential diagnoses or definitive diagnoses based on additional examinations or histopathology. A notable finding was that the accuracy of GPT-3.5 decreased when additional clinical data were added, while the accuracy of GPT-4 improved when additional data were included.

Using GPT-4, a correct diagnosis was made in 100% of dark skin cases that underwent histopathology. This was in contrast to an accuracy rate of 66.7% for GPT-3. However, this difference was not reported as statistically significant by the research team (P= .093).

What these findings mean

This new research demonstrated that these AI models maintained high levels of diagnostic accuracy (72% – 100%), and the team concluded that these characteristics were comparable across both patient cohorts.

The researchers acknowledged the limitations of their study, pointing out that they used a smaller sample and that their people of color cohort consisted only of people who identified as black or Hispanic. In addition, the team explained that their cases were selected from only two textbooks, suggesting limited generalizability.

“As AI tools become more widely used in clinical practice, dermatologists need to understand the impact on different skin types,” they wrote. “To improve the effectiveness of ChatGPT in diagnosing conditions in the SOC, the scientific literature used to train the model needs to include more studies with larger samples in patients with (skin of color).”

References

^{Qureshi S, Alli SR, Ogunyemi B (2024). Accuracy of ChatGPT-3.5 and GPT-4 in diagnosing clinical scenarios in dark-skin dermatology. Int J Dermatol. https://doi.org/10.1111/ijd.17425.}
^{Butt S, Butt H, Gnanappiragasam D. Unintended consequences of artificial intelligence in dermatology for patients with skin of color. Clin Exp Dermatol. 2021; 46(7): 1333–1334.}

Study compares accuracy of ChatGPT-3.5 and GPT-4 in diagnosing dark skin

ByBronte

Study design

ChatGPT-3.5 compared to GPT-4

What these findings mean

By Bronte

Related Post

Quincy Weather | News, weather, sports, breaking news

Daily Kos: Comics

CHATTANOOGA Weather | News, Weather, Sports, Breaking News

Leave a Reply Cancel reply

You missed

Quincy Weather | News, weather, sports, breaking news

Daily Kos: Comics

CHATTANOOGA Weather | News, Weather, Sports, Breaking News

monocle