|
a |
|
b/README.md |
|
|
1 |
<p align="center"> |
|
|
2 |
<img src="images/logo.png" width="40%"> |
|
|
3 |
</p> |
|
|
4 |
|
|
|
5 |
|
|
|
6 |
# Instruction Tuning Large Language Models to Understand Electronic Health Records |
|
|
7 |
|
|
|
8 |
**Authors:** Zhenbang Wu, Anant Dadu, Michael Nalls, Faraz Faghri, Jimeng Sun |
|
|
9 |
|
|
|
10 |
**Published at:** NeurIPS 2024 Datasets and Benchmarks Track (Spotlight) |
|
|
11 |
|
|
|
12 |
[[📑Paper](https://openreview.net/pdf?id=Dgy5WVgPd2)] [[🪧Poster](./poster.pdf)] [[📽️Slides](./slides.pdf)] |
|
|
13 |
|
|
|
14 |
|
|
|
15 |
## Release |
|
|
16 |
- [19 Dec 2024] The trained model weights are released |
|
|
17 |
- [11 Dec 2024] A sample dataset with ~100 patients is added |
|
|
18 |
- [11 Dec 2024] Code for dataset creation, model training, and response evaluation is released |
|
|
19 |
|
|
|
20 |
|
|
|
21 |
## Contents |
|
|
22 |
- [Core Dependencies](#core-dependencies) |
|
|
23 |
- [Data Download](#data-download) |
|
|
24 |
- [Model Download](#model-download) |
|
|
25 |
- [Evaluate](#evaluate) |
|
|
26 |
- [Train](#train) |
|
|
27 |
- [Dataset Creation](#dataset-creation) |
|
|
28 |
- [Notes on Model Enhancements](#notes-on-model-enhancements) |
|
|
29 |
- [Citation](#citation) |
|
|
30 |
|
|
|
31 |
|
|
|
32 |
## Core Dependencies |
|
|
33 |
``` |
|
|
34 |
python 3.9 |
|
|
35 |
torch 2.3.0 |
|
|
36 |
transformers 4.44.0 |
|
|
37 |
peft 0.10.0 |
|
|
38 |
``` |
|
|
39 |
|
|
|
40 |
## Data Download |
|
|
41 |
|
|
|
42 |
<p align="center"> |
|
|
43 |
<img src="images/dataset.png" width="100%"> |
|
|
44 |
</p> |
|
|
45 |
|
|
|
46 |
The **MIMIC-Instr** dataset will be hosted on [PhysioNet](https://physionet.org/) once the preparation and review process is complete. |
|
|
47 |
|
|
|
48 |
A sample dataset generated from the [MIMIC-IV Demo](https://physionet.org/content/mimic-iv-demo/2.2/) database is available in the `sample_data` directory. |
|
|
49 |
|
|
|
50 |
For early access to the full dataset, please reach out to Zhenbang Wu (zw12@illinois.edu) with your CITI training report. |
|
|
51 |
|
|
|
52 |
|
|
|
53 |
## Model Download |
|
|
54 |
|
|
|
55 |
<p align="center"> |
|
|
56 |
<img src="images/model.png" width="100%"> |
|
|
57 |
</p> |
|
|
58 |
|
|
|
59 |
|
|
|
60 |
The pre-trained model checkpoints can be found on the Hugging Face model hub: [zzachw12/llemr-v1](https://huggingface.co/zzachw12/llemr-v1). |
|
|
61 |
|
|
|
62 |
You can load the model using the following code snippet: |
|
|
63 |
|
|
|
64 |
```python |
|
|
65 |
from peft import PeftModel |
|
|
66 |
from src.model.init_llemr import init_llemr |
|
|
67 |
|
|
|
68 |
# Define paths for the base model and LoRA weights |
|
|
69 |
llm_pretrained_model_name_or_path = "lmsys/vicuna-7b-v1.5" |
|
|
70 |
lora_name_or_path = "zzachw12/llemr-v1" |
|
|
71 |
|
|
|
72 |
# Initialize the base model and tokenizer |
|
|
73 |
model, tokenizer = init_llemr(llm_pretrained_model_name_or_path, hidden_size=1027) |
|
|
74 |
|
|
|
75 |
# Integrate the LoRA weights into the model |
|
|
76 |
model = PeftModel.from_pretrained(model, lora_name_or_path) |
|
|
77 |
``` |
|
|
78 |
|
|
|
79 |
**Note:** This model requires pre-computed event embeddings generated by BiomedBERT. Follow [Evaluate](#evaluate) to preprocess the data, generate the response, and evaluate the model. |
|
|
80 |
|
|
|
81 |
|
|
|
82 |
## Evaluate |
|
|
83 |
|
|
|
84 |
1. Download the MIMIC-Instr dataset from PhysioNet |
|
|
85 |
|
|
|
86 |
2. Run steps 1, 4, 7, 8 in [Data Generation](#data-generation) to prepare the event sequence data and pre-compute the event embeddings |
|
|
87 |
|
|
|
88 |
3. Generate the model response with [query_llemr.ipynb](src/eval/query_llemr.ipynb) |
|
|
89 |
|
|
|
90 |
4. Compare the model response with the GPT-4 reference answer with [eval.ipynb](src/eval/eval.ipynb) (need OpenAI Azure service) |
|
|
91 |
|
|
|
92 |
5. Summarize the results with [summary_eval.ipynb](src/eval/summary_eval.ipynb) |
|
|
93 |
|
|
|
94 |
|
|
|
95 |
## Train |
|
|
96 |
|
|
|
97 |
1. Download the MIMIC-Instr dataset from PhysioNet |
|
|
98 |
|
|
|
99 |
2. Run steps 1, 4, 7, 8 in [Data Generation](#data-generation) to prepare the event sequence data and pre-compute the event embeddings |
|
|
100 |
|
|
|
101 |
3. Run the training script [train.py](src/train/train.py): |
|
|
102 |
- CMD: `sh src/train/train.sh` |
|
|
103 |
|
|
|
104 |
|
|
|
105 |
## Dataset Creation |
|
|
106 |
|
|
|
107 |
1. Download the [MIMIC-IV](https://physionet.org/content/mimiciv/2.2/) in the `raw_data` directory |
|
|
108 |
|
|
|
109 |
2. Download the [MIMIC-IV-Note](https://physionet.org/content/mimic-iv-note/2.2/) dataset in the `raw_data` directory |
|
|
110 |
|
|
|
111 |
3. Run the following jupyter notebook to select the patient cohort: [01_cohort_selection.ipynb](src/preprocess/01_cohort_selection.ipynb) |
|
|
112 |
|
|
|
113 |
4. Run the following jupyter notebooks to prepare the event sequence data: |
|
|
114 |
- 1. Extract events: |
|
|
115 |
- [02_event_static.ipynb](src/preprocess/02_event_static.ipynb) |
|
|
116 |
- [02_event_hosp_diagnoses_icd.ipynb](src/preprocess/02_event_hosp_diagnoses_icd.ipynb) |
|
|
117 |
- [02_event_hosp_labevents.ipynb](src/preprocess/02_event_hosp_labevents.ipynb) |
|
|
118 |
- [02_event_hosp_microbiologyevents.ipynb](src/preprocess/02_event_hosp_microbiologyevents.ipynb) |
|
|
119 |
- [02_event_hosp_prescriptions.ipynb](src/preprocess/02_event_hosp_prescriptions.ipynb) |
|
|
120 |
- [02_event_hosp_transfers.ipynb](src/preprocess/02_event_hosp_transfers.ipynb) |
|
|
121 |
- [02_event_icu_chartevents.ipynb](src/preprocess/02_event_icu_chartevents.ipynb) |
|
|
122 |
- [02_event_icu_inputevents.ipynb](src/preprocess/02_event_icu_inputevents.ipynb) |
|
|
123 |
- [02_event_icu_outputevents.ipynb](src/preprocess/02_event_icu_outputevents.ipynb) |
|
|
124 |
- [02_event_icu_procedureevents.ipynb](src/preprocess/02_event_icu_procedureevents.ipynb) |
|
|
125 |
- 2. Merge events: [03_merge_events.ipynb](src/preprocess/03_merge_events.ipynb) |
|
|
126 |
|
|
|
127 |
5. Run the following jupyter notebooks to generate the instruction tuning data: |
|
|
128 |
- Run this only if you want to generate the instruction tuning data on your own |
|
|
129 |
- 1. Generate the schema alignment subset: |
|
|
130 |
- [04_template_qa_event.ipynb](src/preprocess/04_template_qa_event.ipynb) |
|
|
131 |
- [04_paraphrase_qa_event.ipynb](src/preprocess/04_paraphrase_qa_event.ipynb) (need OpenAI Azure service) |
|
|
132 |
- 2. Generate the instruction following subset: |
|
|
133 |
- [04_generate_qa_note.ipynb](src/preprocess/04_generate_qa_note.ipynb) (need OpenAI Azure service) |
|
|
134 |
|
|
|
135 |
6. Split the data into train, validation, and test sets: |
|
|
136 |
- [05_data_split.ipynb](src/preprocess/05_data_split.ipynb) |
|
|
137 |
|
|
|
138 |
7. Pre-compute the event embeddings with [06_precompute_event_embeddings.py](src/preprocess/06_precompute_event_embeddings.py): |
|
|
139 |
- CMD: `sh src/preprocess/precompute_event_embeddings.sh` |
|
|
140 |
|
|
|
141 |
8. Generate the GPT-4 reference answer with [query_gpt4.ipynb](src/eval/query_gpt4.ipynb) |
|
|
142 |
|
|
|
143 |
|
|
|
144 |
## Notes on Model Enhancements |
|
|
145 |
|
|
|
146 |
This repository incorporates several minor improvements over the original implementation described in the paper: |
|
|
147 |
|
|
|
148 |
1. **Enhanced Event Encoder:** |
|
|
149 |
- Replaced ClinicalBERT (`emilyalsentzer/Bio_ClinicalBERT`) with BiomedBERT-large (`microsoft/BiomedNLP-BiomedBERT-large-uncased-abstract`), improving the quality of event embeddings |
|
|
150 |
|
|
|
151 |
2. **Improved Event Embedding:** |
|
|
152 |
- Concatenated event timestamps and numeric values (where available) to the final event embeddings, resulting in better representation of time-sensitive and quantitative data |
|
|
153 |
|
|
|
154 |
3. **Expanded Dataset:** |
|
|
155 |
- Increased the size of the clinical reasoning subset to 100K examples, doubling the data from the original 50K subset for more comprehensive coverage. |
|
|
156 |
|
|
|
157 |
4. **Unified Training Approach:** |
|
|
158 |
- Adopted a single-step training process that integrates schema alignment and clinical reasoning subsets simultaneously, streamlining the training pipeline |
|
|
159 |
|
|
|
160 |
These advancements collectively enhance the model's ability to interpret and reason with EHR data, delivering superior performance compared to its predecessor. |
|
|
161 |
|
|
|
162 |
|
|
|
163 |
## Citation |
|
|
164 |
|
|
|
165 |
If you find this work useful, please cite: |
|
|
166 |
``` |
|
|
167 |
@inproceedings{ |
|
|
168 |
wu2024instruction, |
|
|
169 |
title={Instruction Tuning Large Language Models to Understand Electronic Health Records}, |
|
|
170 |
author={Zhenbang Wu and Anant Dadu and Michael Nalls and Faraz Faghri and Jimeng Sun}, |
|
|
171 |
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track}, |
|
|
172 |
year={2024}, |
|
|
173 |
url={https://openreview.net/forum?id=Dgy5WVgPd2} |
|
|
174 |
} |
|
|
175 |
``` |
|
|
176 |
|
|
|
177 |
\* Note: The teaser image above the title is generated by ChatGPT. |