Instruction Tuning Large Language Models to Understand Electronic Health Records

Authors: Zhenbang Wu, Anant Dadu, Michael Nalls, Faraz Faghri, Jimeng Sun

Published at: NeurIPS 2024 Datasets and Benchmarks Track (Spotlight)

[📑Paper] [🪧Poster] [📽️Slides]

Release

[19 Dec 2024] The trained model weights are released
[11 Dec 2024] A sample dataset with ~100 patients is added
[11 Dec 2024] Code for dataset creation, model training, and response evaluation is released

Core Dependencies
Data Download
Model Download
Evaluate
Train
Dataset Creation
Notes on Model Enhancements
Citation

Core Dependencies

python 3.9
torch 2.3.0
transformers 4.44.0
peft 0.10.0

Data Download

The MIMIC-Instr dataset will be hosted on PhysioNet once the preparation and review process is complete.

A sample dataset generated from the MIMIC-IV Demo database is available in the sample_data directory.

For early access to the full dataset, please reach out to Zhenbang Wu (zw12@illinois.edu) with your CITI training report.

Model Download

The pre-trained model checkpoints can be found on the Hugging Face model hub: zzachw12/llemr-v1.

You can load the model using the following code snippet:

from peft import PeftModel
from src.model.init_llemr import init_llemr

# Define paths for the base model and LoRA weights
llm_pretrained_model_name_or_path = "lmsys/vicuna-7b-v1.5"
lora_name_or_path = "zzachw12/llemr-v1"

# Initialize the base model and tokenizer
model, tokenizer = init_llemr(llm_pretrained_model_name_or_path, hidden_size=1027)

# Integrate the LoRA weights into the model
model = PeftModel.from_pretrained(model, lora_name_or_path)

Note: This model requires pre-computed event embeddings generated by BiomedBERT. Follow Evaluate to preprocess the data, generate the response, and evaluate the model.

Evaluate

Download the MIMIC-Instr dataset from PhysioNet
Run steps 1, 4, 7, 8 in Data Generation to prepare the event sequence data and pre-compute the event embeddings
Generate the model response with query_llemr.ipynb
Compare the model response with the GPT-4 reference answer with eval.ipynb (need OpenAI Azure service)
Summarize the results with summary_eval.ipynb

Train

Download the MIMIC-Instr dataset from PhysioNet
Run steps 1, 4, 7, 8 in Data Generation to prepare the event sequence data and pre-compute the event embeddings
Run the training script train.py:
CMD: sh src/train/train.sh

Dataset Creation

Download the MIMIC-IV in the raw_data directory
Download the MIMIC-IV-Note dataset in the raw_data directory
Run the following jupyter notebook to select the patient cohort: 01_cohort_selection.ipynb
Run the following jupyter notebooks to prepare the event sequence data:
1. Merge events: 03_merge_events.ipynb
Run the following jupyter notebooks to generate the instruction tuning data:
Run this only if you want to generate the instruction tuning data on your own
1. Generate the schema alignment subset:
2. 04_template_qa_event.ipynb
3. 04_paraphrase_qa_event.ipynb (need OpenAI Azure service)
1. Generate the instruction following subset:
2. 04_generate_qa_note.ipynb (need OpenAI Azure service)
Split the data into train, validation, and test sets:
05_data_split.ipynb
Pre-compute the event embeddings with 06_precompute_event_embeddings.py:
- CMD: sh src/preprocess/precompute_event_embeddings.sh
Generate the GPT-4 reference answer with query_gpt4.ipynb

Notes on Model Enhancements

This repository incorporates several minor improvements over the original implementation described in the paper:

Enhanced Event Encoder:
Replaced ClinicalBERT (emilyalsentzer/Bio_ClinicalBERT) with BiomedBERT-large (microsoft/BiomedNLP-BiomedBERT-large-uncased-abstract), improving the quality of event embeddings
Improved Event Embedding:
Concatenated event timestamps and numeric values (where available) to the final event embeddings, resulting in better representation of time-sensitive and quantitative data
Expanded Dataset:
Increased the size of the clinical reasoning subset to 100K examples, doubling the data from the original 50K subset for more comprehensive coverage.
Unified Training Approach:
Adopted a single-step training process that integrates schema alignment and clinical reasoning subsets simultaneously, streamlining the training pipeline

These advancements collectively enhance the model's ability to interpret and reason with EHR data, delivering superior performance compared to its predecessor.

Citation

If you find this work useful, please cite:

@inproceedings{
    wu2024instruction,
    title={Instruction Tuning Large Language Models to Understand Electronic Health Records},
    author={Zhenbang Wu and Anant Dadu and Michael Nalls and Faraz Faghri and Jimeng Sun},
    booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
    year={2024},
    url={https://openreview.net/forum?id=Dgy5WVgPd2}
}

* Note: The teaser image above the title is generated by ChatGPT.