Authors: Zhenbang Wu, Anant Dadu, Michael Nalls, Faraz Faghri, Jimeng Sun
Published at: NeurIPS 2024 Datasets and Benchmarks Track (Spotlight)
python 3.9
torch 2.3.0
transformers 4.44.0
peft 0.10.0
The MIMIC-Instr dataset will be hosted on PhysioNet once the preparation and review process is complete.
A sample dataset generated from the MIMIC-IV Demo database is available in the sample_data
directory.
For early access to the full dataset, please reach out to Zhenbang Wu (zw12@illinois.edu) with your CITI training report.
The pre-trained model checkpoints can be found on the Hugging Face model hub: zzachw12/llemr-v1.
You can load the model using the following code snippet:
from peft import PeftModel
from src.model.init_llemr import init_llemr
# Define paths for the base model and LoRA weights
llm_pretrained_model_name_or_path = "lmsys/vicuna-7b-v1.5"
lora_name_or_path = "zzachw12/llemr-v1"
# Initialize the base model and tokenizer
model, tokenizer = init_llemr(llm_pretrained_model_name_or_path, hidden_size=1027)
# Integrate the LoRA weights into the model
model = PeftModel.from_pretrained(model, lora_name_or_path)
Note: This model requires pre-computed event embeddings generated by BiomedBERT. Follow Evaluate to preprocess the data, generate the response, and evaluate the model.
Download the MIMIC-Instr dataset from PhysioNet
Run steps 1, 4, 7, 8 in Data Generation to prepare the event sequence data and pre-compute the event embeddings
Generate the model response with query_llemr.ipynb
Compare the model response with the GPT-4 reference answer with eval.ipynb (need OpenAI Azure service)
Summarize the results with summary_eval.ipynb
Download the MIMIC-Instr dataset from PhysioNet
Run steps 1, 4, 7, 8 in Data Generation to prepare the event sequence data and pre-compute the event embeddings
Run the training script train.py:
sh src/train/train.sh
Download the MIMIC-IV in the raw_data
directory
Download the MIMIC-IV-Note dataset in the raw_data
directory
Run the following jupyter notebook to select the patient cohort: 01_cohort_selection.ipynb
Run the following jupyter notebooks to prepare the event sequence data:
Run the following jupyter notebooks to generate the instruction tuning data:
Split the data into train, validation, and test sets:
Pre-compute the event embeddings with 06_precompute_event_embeddings.py:
sh src/preprocess/precompute_event_embeddings.sh
Generate the GPT-4 reference answer with query_gpt4.ipynb
This repository incorporates several minor improvements over the original implementation described in the paper:
Replaced ClinicalBERT (emilyalsentzer/Bio_ClinicalBERT
) with BiomedBERT-large (microsoft/BiomedNLP-BiomedBERT-large-uncased-abstract
), improving the quality of event embeddings
Improved Event Embedding:
Concatenated event timestamps and numeric values (where available) to the final event embeddings, resulting in better representation of time-sensitive and quantitative data
Expanded Dataset:
Increased the size of the clinical reasoning subset to 100K examples, doubling the data from the original 50K subset for more comprehensive coverage.
Unified Training Approach:
These advancements collectively enhance the model's ability to interpret and reason with EHR data, delivering superior performance compared to its predecessor.
If you find this work useful, please cite:
@inproceedings{
wu2024instruction,
title={Instruction Tuning Large Language Models to Understand Electronic Health Records},
author={Zhenbang Wu and Anant Dadu and Michael Nalls and Faraz Faghri and Jimeng Sun},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=Dgy5WVgPd2}
}
* Note: The teaser image above the title is generated by ChatGPT.