Should You Use AI to Diagnose Patients: Foresight and ChatGPT

AI is all over the place these days and it is just a matter of time before people start using it to diagnose themselves, or someone else.

Mar 20, 2023

I would not, at least not yet. There have been almost no tests of Large Language Models (LLMs) like ChatGPT in real-world healthcare or on real patient data. The fact that ChatGPT or MedPaLMv2 can pass a medical exam (example), or answer multiple-choice questions does not really say much, as we will see from the tests below. However, it does demonstrates the potential for AI in healthcare.

Together with Joshua Au Yeung et al., I’ve recently done a research paper that explores the performance of ChatGPT and Foresight in diagnosing patients. I'll share the key findings here, but if interested, please have a look at the full paper (frontiers). We focused on these two models as MedPaLMv2 (or even v1) was not made public (thanks google), and other models like BioGPT or T5 are not suitable for diagnosing patients.

Prior to discussing the outcomes, challenges, and prospects, a brief introduction to Foresight and ChatGPT. We'll explore the nature of these models and gain a high-level understanding of their inner workings.

What are Foresight and ChatGPT

Both Foresight and ChatGPT are Generative Pretrained Transformers (GPT), a class of machine learning models first introduced by Google in 2017. The transformer model achieved state-of-the-art performance on a wide range of tasks (e.g. Named Entity Recognition, Question Answering, Text summarisation), surpassing the then-prevalent architecture in NLP, the recurrent neural network (RNN). Today’s transformers have billions of parameters and are trained on vast amounts of text data (i.e. billions of words), resulting in large language models (LLMs) that exhibit human-like proficiency in numerous NLP tasks. Additionally, they possess emergent properties, enabling them to tackle tasks they weren't explicitly trained for, like recommending medications, diagnosing people, and other fun things.

Foresight (link to paper) is a specialized biomedical model, it was trained on real-world hospital data specifically for the use-case of predicting a patient’s future. As input, it takes a patient timeline, in other words, a sequence of medically relevant past events - and uses that to predict the patient’s future (Fig. 1).

Fig. 1 Patient timeline with Foresight predictions. The left side (green) is the patient’s medical history and the right side (orange) are the predictions from Foresight.

ChatGPT is a generalist, it is currently the most widely used, and possibly the most versatile publicly available LLM. Although the specifics of its training data are not publicly available, it is believed to have been trained on a vast amount of internet text, likely including some biomedical and patient data. ChatGPT, as the name implies, is a chatbot, it takes text as input (e.g. a description of a patient’s biomedical history) and answers a given question (e.g. what is the most likely diagnosis, Fig. 2).

Fig 2. An example of using ChatGPT to generate a diagnosis given a simulated patient history (vignette).

The experimental setup

To simulate a real-world scenario, we’ve crafted synthetic clinical histories (in the style of a vignette) and tasked the models to predict the 5 most likely diagnoses. The vignettes were provided as prompts to Foresight (in the form of a timeline, like in Fig. 1) and ChatGPT (in the form of text, like in Fig. 2). Five clinicians then scored the relevancy of each prediction, and also recorded whether any crucial diagnoses were missing. Relevancy was chosen over Accuracy since there were frequent disagreements between clinicians on what concepts were the most ‘correct’.

Results

Both models had high quantitative performance (Fig. 3), with slightly superior performance in Foresight compared to ChatGPT for relevancy (83% vs 78% in the top 5 forecasted concepts, Fig. 3). However, clinicians reported that 21 out of 35 (60%) vignettes outputs from ChatGPT contained one or more crucial missing diagnoses (Example in Fig. 4), which is unsurprising since ChatGPT is not domain-specific. Qualitatively, ChatGPT provides a substantially more eloquent free text generation but often with superficial high-level disease prediction categories instead of specific diseases (e.g. cardiac arrhythmia), while Foresight outputs much more specific suggestions (e.g. right bundle branch block).

Fig. 3 Table showing manual clinician evaluation of Foresight and ChatGPT outputs on 34 imaginary patient vignettes. Columns represent the number of relevant differential diagnoses in top N forecasted outputs.

Fig 4. Three examples with predictions from ChatGPT and Foresight. For each prediction, R - Relevant, N - Not relevant. The missing column shows disorders that ChatGPT did not predict but should have given the patient’s history.

In some simple cases, ChatGPT and even GPT-4 completely miss the obvious diagnosis, like in Fig. 5 which is a common sense test for symptoms indicating pregnancy.

Fig. 5 An example of GPT-4 completely missing the most common diagnosis given the symptoms - pregnancy.

The good and the bad

The score both Foresight and ChatGPT achieve is outstanding, having a relevancy of 93% for the top prediction on 35 fairly difficult and limited patient histories is not an easy task. However, it also reveals that 7% of predictions are incorrect, and patients would likely be unhappy if their doctor informed them there was approximately a 10% chance their diagnosis was completely wrong. Also, the performance of these models may be limited for certain demographics, such as children, or they may exhibit biases towards specific ethnicities.

Additionally, LLMs like ChatGPT can occasionally hallucinate, in other words, the output is completely made up. And to make this worse, even when LLMs hallucinate they use a very confident style of writing, so unless someone is an expert in the field they would not be able to say if the output is a hallucination or not. This poses a problem in healthcare because not everyone is a healthcare professional. And if a model confidently provides an incorrect diagnosis or harmful advice, this could lead to significant problems for patients.

How do we do this then? All the problems and challenges simply mean we need to be careful when deploying AI models for healthcare. I would say firstly, we should validate every possible aspect of an AI model on real patient data together with healthcare professionals. And secondly, I do not think the model should be used by patients to diagnose themselves, there are many other useful applications:

Decision support tool - Clinicians could use it as a support tool to make sure a diagnosis is not missed, or something was not overlooked.
Continual patient monitoring - a good thing about an AI is it does not get tired (probably), so it could monitor a patient (or millions of them) 24/7 for any kind of anomaly and do real-time risk prediction and alerting.
Digital twins - a digital twin of a patient can be used to simulate how different procedures and therapies will work.
Education - A model properly trained on medical guidelines, literature, published papers, and finetuned by clinicians, could be a valuable source of information for patients and medical students. While the model can still make mistakes, they can be brought down to a minimum with continual learning.
Clinical trials - Matching patients to clinical trials is a very slow and arduous process, an AI like Foresight could easily generate patient representations that can be matched to trials in seconds.

There are many ways in which AI can improve healthcare, make it safer, more efficient, and in a way transition it from reactive to preventative (but this is a topic for another day).

AI for Healthcare

Discussion about this post