Overview of Existing Research of LLMs for Healthcare

Where we are and what is missing.

Feb 07, 2024

I was writing the introduction for a new paper and thought it might be interesting to some of you. Also, by the time the paper gets published, this overview will be so outdated that no one will care for it, so better to make it available like this (tells you something about how good academic publishing is working these days). In any case, here is an overview of how LLMs are used in healthcare (or medicine, sometimes I use this interchangeably). If I missed something big, let me know.

Introduction

Language plays a central role in healthcare and medical practice, and free text is the most abundant resource in Electronic Health Records (EHR)1. Yet today, AI models have largely failed to utilize this resource and mostly ignore the free text portion of the EHR. More recently Large Language Models (LLMs) have shown the potential to understand human language, but even the most advanced LLMs (as well as specialised medical LLMs (e.g. MedPALM-2)) in general do not use the free text portion of the EHR. What is more, most LLMs are not trained/tested/validated on real-world hospital data, but on medical quizzes and exams.

Today’s large language models have seen a remarkable evolution. Initial models like BERT, RoBERTa, T5, GPT-1 and GPT-2 set the stage. The BERT family notably changed natural language processing (NLP), largely replacing RNN-based models in tasks such as Named Entity Recognition (NER) and text classification. Meanwhile, the GPT series, focused solely on text generation, sought to predict the next word in a sequence. Despite initial limitations (the prompts had to be very specific, and the model would hallucinate often), these models showed potential.

But the real NLP revolution started with highly capable general LLMs such as ChatGPT, GPT-4 and LLaMA-2. These models enabled use cases that before were either extremely difficult or completely impossible. Tasks such as document summarization, text classification, code writing, and question answering were now reduced to simple prompting. Today, state-of-the-art for a wide range of NLP tasks is being set almost with every release of a new LLM.

How is this Applied to Medical Domain

The current LLM research in the medical domain can be split into three groups: 1) Directly using LLMs on medical tasks without any fine-tuning (mainly via prompt engineering); 2) Fine-tuning existing LLMs for the medical domain; and 3) Training LLMs from the ground up on medical data (this approach is very rare).

Group 1 | Evaluating Existing LLMs on Medical Tasks

A large amount of recent work has focused on approaches from this group. Khan et al.2 test GPT-4 for Anesthesiology Board-style Examination Questions, in total, they've collected 884 questions and prompted GPT-4 for answers, they show promising results but note that GPT-4 is still lacking in this area and more research is needed for both validation and training. Lonergan et al.3 show similar results for surgery, they collected 23,035 questions from MedMCQA and prompted GPT-4 for answers. They note that GPT-4 shows promising results, but still requires more training and testing. On a similar note, Thomas et al.4 explore how to construct prompts so that the reasoning style of GPT-4 matches that of clinicians, the dataset used is a modified MedQA USMLE.

Most of the reseracher in this group can be summarised into - a research group had a dataset lying around, and decided to write an easy paper (there are some really nice papers, but it does really look like a lot of them are there to fill the quota).

Group 2 | Finetuning existing LLMs on Medical Tasks

There is significantly less work in this group. MedPaLM 1&2 are closed-source (and closed access) models from Google that build on the PaLM architecture5. The models are trained on QA-style datasets and show state-of-the-art results on the USMLE-style questions from MultiMedQA. MediTRON 70B builds on top of LLaMA-2 70B and is finetuned on medical papers and clinical guidelines, the model is primarily tested on QA style datasets where they show that it outperforms general models like GPT-3.5, and comes close to closed source medical models like MedPaLM 2.

Finetuning existing LLMs on medical data for now outperformes general LLMs (including GPT-4) on QA style taks, but what is important to note is that the performance is not that much better. Hard to say why is this, my assumption would be a lot of this MedQA style questions are already part of the training data for GPT-4, and very little additonal specialised medical data (e.g. from hospitals6) is used to further fine-tune the models.

Group 3 | Training Medical LLMs from the Ground Up

And lastly, there are only a few examples from in this group. Yang et al.7 train a large language model with 8.9B parameters on a dataset with >90 billion words (including >82B words of de-identified clinical text) and evaluate it on clinical NLP tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference, and medical question answering. Most other examples in this group are not what we would today consider LLMs (and not generative), those include models like BioBERT and ClinicalBERT.

This approach is mostly being abandoned, because LLMs need a wide variety of data to learn properly. Training models on one tasks improves their understanding of others (especially exemplifed on coding tasks, teaching models how to code improves their overall performance).

What is Missing?

Given the examples above, the vast majority of training/validation was performed on medical quizzes and not on real-world hospital data - highlighting the disparity between real-world use cases and LLM research in the medical domain. This excludes the work from Yang et al. which was trained on hospital data, but still tested mainly on public benchmarks for medical question answering, named entity recognition, and similar.

While the current work is great and showcases some of the possibilities of LLMs, it would be good to start using hospital data also (for training and testing). Hospital data is messy, imperfect, and much more challenging. While medical quizzes are usually made so that they contain all the necessary information to provide the answer, real data is not like that. Finally, some possible use cases to explore8 (given hospital data as input) and test how LLMs work on this could be:

Diagnosis
Prediction of
- Adverse reactions
- Complications
- Medications
- Procedures
Warning systems for Medication interaction
Care pathways
Digital twins

Random notes

Searching for the academic papers above, every single LLM (perplexity, bard, get-4) was not that useful. I tried a lot of prompts to get good recommendations and links for the papers I needed, but nothing much really. I did not fully explore the academic writing copilots (e.g. https://scite.ai/ and similar), but while they were a bit better for this use case it was not that great. In the end, I used mostly the standard approach where I look for systematic reviews in the area of interest; or use Google Scholar to find a couple of relevant papers and look at every paper that cited those, or whom they cite (i.e. going through a graph of citations/references once I found a couple of relevant papers).

Papers that are not referenced but helped in writing:

A Bibliometric Analysis of the Rise of ChatGPT in Medical Research — https://www.mdpi.com/2076-3271/11/3/61

Thank you for reading,

Zeljko

https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-018-0623-9

https://www.sciencedirect.com/science/article/pii/S1053077024000909

https://pubmed.ncbi.nlm.nih.gov/38098921/

https://www.nature.com/articles/s41746-024-01010-1

Given that Gemini came out recently, I assume we can expect a MedGemini soon, and it will work even better for MedQA stuff.

Directly training on hospital data is hard, it is very messy and full of redundancies, have a look at the Foresight paper if you want to see all the things I had to do to train an LLM on hospital data (https://arxiv.org/abs/2212.08072).

https://arxiv.org/abs/2203.03540

Explore in this case means to research them, not deploy LLMs in hospitals and use them on real patients

AI for Healthcare

Discussion about this post