a b/README.md
1
# EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images
2
3
*A multi-modal question answering dataset that combines structured Electronic Health Records (EHRs) and chest X-ray images, designed to facilitate joint reasoning across imaging and table modalities in EHR Question Answering (QA) systems.*
4
5
6
## Overview
7
Electronic Health Records (EHRs), which contain patients' medical histories in various multi-modal formats, often overlook the potential for joint reasoning across imaging and table modalities underexplored in current EHR Question Answering (QA) systems. In this paper, we introduce EHRXQA, a novel multi-modal question answering dataset combining structured EHRs and chest X-ray images. To develop our dataset, we first construct two uni-modal resources: 1) The MIMIC-CXR-VQA dataset, our newly created medical visual question answering (VQA) benchmark, specifically designed to augment the imaging modality in EHR QA, and 2) EHRSQL (MIMIC-IV), a refashioned version of a previously established table-based EHR QA dataset. By integrating these two uni-modal resources, we successfully construct a multi-modal EHR QA dataset that necessitates both uni-modal and cross-modal reasoning. To address the unique challenges of multi-modal questions within EHRs, we propose a NeuralSQL-based strategy equipped with an external VQA API. This pioneering endeavor enhances engagement with multi-modal EHR sources and we believe that our dataset can catalyze advances in real-world medical scenarios such as clinical decision-making and research.
8
9
10
## Updates
11
- [07/24/2024] We released [EHRXQA dataset](https://physionet.org/content/ehrxqa/1.0.0/) on Physionet.
12
- [12/12/2023] We presented our research work at NeurIPS 2023 Datasets and Benchmarks Track as a [poster](https://neurips.cc/virtual/2023/poster/73600).
13
- [10/28/2023] We released our research paper on [arXiv](https://arxiv.org/abs/2310.18652).
14
15
16
## Features
17
18
- [x] Provide a script to download source datasets (MIMIC-CXR-JPG, Chest ImaGenome, and MIMIC-IV) from Physionet.
19
- [x] Provide a script to preprocess the source datasets.
20
- [x] Provide a script to construct an integrated database (MIMIC-IV and MIMIC-CXR).
21
- [x] Provide a script to generate the EHRXQA dataset (with answer information).
22
23
## Installation
24
25
### For Linux:
26
27
Ensure that you have Python 3.8.5 or higher installed on your machine. Set up the environment and install the required packages using the commands below:
28
29
```
30
# Set up the environment
31
conda create --name ehrxqa python=3.8.5
32
33
# Activate the environment
34
conda activate ehrxqa
35
36
# Install required packages
37
pip install pandas==1.1.3 tqdm==4.65.0 scikit-learn==0.23.2 
38
pip install dask=='2022.12.1'
39
```
40
41
## Setup
42
43
Clone this repository and navigate into it:
44
45
```
46
git clone https://github.com/baeseongsu/ehrxqa.git
47
cd ehrxqa
48
```
49
50
## Usage
51
52
### Privacy
53
54
We take data privacy very seriously. All of the data you access through this repository has been carefully prepared to prevent any privacy breaches or data leakage. You can use this data with confidence, knowing that all necessary precautions have been taken.
55
56
### Access Requirements
57
58
The EHRXQA dataset is constructed from the MIMIC-CXR-JPG (v2.0.0), Chest ImaGenome (v1.0.0), and MIMIC-IV (v2.2). All these source datasets require a credentialed Physionet license. Due to these requirements and in adherence to the Data Use Agreement (DUA), only credentialed users can access the MIMIC-CXR-VQA dataset files (see Access Policy). To access the source datasets, you must fulfill all of the following requirements:
59
60
1. Be a [credentialed user](https://physionet.org/settings/credentialing/)
61
    - If you do not have a PhysioNet account, register for one [here](https://physionet.org/register/).
62
    - Follow these [instructions](https://physionet.org/credential-application/) for credentialing on PhysioNet.
63
    - Complete the "CITI Data or Specimens Only Research" [training course](https://physionet.org/about/citi-course/).
64
2. Sign the data use agreement (DUA) for each project
65
    - https://physionet.org/sign-dua/mimic-cxr-jpg/2.0.0/
66
    - https://physionet.org/sign-dua/chest-imagenome/1.0.0/
67
    - https://physionet.org/sign-dua/mimiciv/2.2/
68
69
### Accessing the EHRXQA Dataset
70
71
72
While the complete EHRXQA dataset is being prepared for publication on the Physionet platform, we provide partial access to the dataset via this repository for credentialed users. 
73
74
To access the EHRXQA dataset, you can run the provided main script (which requires your unique Physionet credentials) in this repository as follows:
75
76
```
77
bash build_dataset.sh
78
```
79
80
During script execution, enter your PhysioNet credentials when prompted:
81
82
- Username: Enter your PhysioNet username and press `Enter`.
83
- Password: Enter your PhysioNet password and press `Enter`. The password characters won't appear on screen.
84
85
This script performs several actions: 1) it downloads the source datasets from Physionet, 2) preprocesses these datasets, and 3) generates the complete EHRXQA dataset by creating ground-truth answer information.
86
87
Ensure you keep your credentials secure. If you encounter any issues, please ensure that you have the necessary permissions, a stable internet connection, and all prerequisite tools installed.
88
89
### Downloading MIMIC-CXR-JPG Images
90
91
<!---
92
To enhance user convenience, we will provide a script that allows you to download only the CXR images relevant to the MIMIC-CXR-VQA dataset, rather than downloading all the MIMIC-CXR-JPG images.
93
94
```
95
bash download_images.sh
96
```
97
98
During script execution, enter your PhysioNet credentials when prompted:
99
100
- Username: Enter your PhysioNet username and press `Enter`.
101
- Password: Enter your PhysioNet password and press `Enter`. The password characters won't appear on screen.
102
103
This script performs several actions: 1) it reads the image paths from the JSON files of the MIMIC-CXR-VQA dataset; 2) uses these paths to download the corresponding images from the MIMIC-CXR-JPG dataset hosted on Physionet; and 3) saves these images locally in the corresponding directories as per their paths.
104
--->
105
106
### Dataset Structure
107
108
The dataset is structured as follows:
109
110
```
111
ehrxqa
112
└── dataset
113
    ├── _train_.json
114
    ├── _valid.json
115
    ├── _test.json
116
    ├── train.json (available post-script execution)
117
    ├── valid.json (available post-script execution)
118
    └── test.json  (available post-script execution)
119
```
120
121
- The `ehrxqa` is the root directory. Within this, the `dataset` directory contains various JSON files that are part of the EHRXQA dataset.
122
- `_train.json`, `_valid.json`, and `_test.json` are pre-release versions of the dataset files corresponding to the training, validation, and testing sets respectively. These versions are intentionally incomplete to safeguard privacy and prevent the leakage of sensitive information; they do not include certain crucial information, such as the answers.
123
- Once the main script is executed with valid Physionet credentials, the full versions of these files - `train.json`, `valid.json`, and `test.json` - will be generated. These files contain the complete information, including the corresponding answers for each entry in the respective sets.
124
125
### Dataset Description
126
127
The QA samples in the EHRXQA dataset are stored in individual .json files. Each file contains a list of Python dictionaries, with each key indicating:
128
129
- `db_id`: A string representing the corresponding database ID.
130
- `split`: The dataset split category (e.g., training, test, validation).
131
- `id`: A unique identifier for each instance in the dataset.
132
- `question`: A paraphrased version of the question.
133
- `template`: The final question template created by injecting real database values into the tag. This template represents the fully specified and contextualized form of the question.
134
- `query`: The corresponding NeuralSQL/SQL query for the question.
135
- `value`: Specific key-value pairs relevant to the question, sampled from the database.
136
- `q_tag`: The initial sampled question template. This serves as the foundational structure for the question.
137
- `t_tag`: Sampled time templates, used to provide temporal context and specificity to the question.
138
- `o_tag`: Sampled operational values for the query, often encompassing numerical or computational aspects required for forming the question.
139
- `v_tag`: Sampled visual values, which include elements like object, category, attribute, and comparison, adding further details to the question.
140
- `tag`: A comprehensive tag that synthesizes the enhanced q_tag with additional elements (t_tag, o_tag, v_tag). This represents an intermediate, more specified version of the question template before the final template is formed.
141
- `para_type`: The source of the paraphrase, either from a general machine-generated tool or specifically by GPT-4.
142
- `is_impossible`: A boolean indicating whether the question is answerable based on the dataset.
143
- `_gold_program`: A temporary program that is used to generate the answer.
144
145
After validating PhysioNet credentials, the create_answer.py script generates the following items:
146
147
- `answer`: The answer string based on the query execution.
148
149
To be specific, here is the example instance:
150
```
151
{
152
    'db_id': 'mimic_iv_cxr', 
153
    'split': 'train',
154
    'id': 0, 
155
    'question': 'how many days have passed since the last chest x-ray of patient 18679317 depicting any anatomical findings in 2105?', 
156
    'template': 'how many days have passed since the last time patient 18679317 had a chest x-ray study indicating any anatomicalfinding in 2105?', 
157
    'query': 'select 1 * ( strftime(\'%J\',current_time) - strftime(\'%J\',t1.studydatetime) ) from ( select tb_cxr.study_id, tb_cxr.studydatetime from tb_cxr where tb_cxr.study_id in ( select distinct tb_cxr.study_id from tb_cxr where tb_cxr.subject_id = 18679317 and strftime(\'%Y\',tb_cxr.studydatetime) = \'2105\' ) ) as t1 where func_vqa("is the chest x-ray depicting any anatomical findings?", t1.study_id) = true', 
158
    'value': {'patient_id': 18679317}, 
159
    'q_tag': 'how many [unit_count] have passed since the [time_filter_exact1] time patient {patient_id} had a chest x-ray study indicating any ${category} [time_filter_global1]?', 
160
    't_tag': ['abs-year-in', '', '', 'exact-last', ''], 
161
    'o_tag': {'unit_count': {'nlq': 'days', 'sql': '1 * ', 'type': 'days', 'sql_pattern': '[unit_count]'}}, 
162
    'v_tag': {'object': [], 'category': ['anatomicalfinding'], 'attribute': []}, 
163
    'tag': 'how many [unit_count:days] have passed since the [time_filter_exact1:exact-last] time patient {patient_id} had a chest x-ray study indicating any anatomicalfinding [time_filter_global1:abs-year-in]?',
164
    'para_type': 'machine', 
165
    'is_impossible': False, 
166
    'answer': 'Will be generated by dataset_builder/generate_answer.py'
167
}
168
```
169
170
## Versioning
171
172
We employ semantic versioning for our dataset, with the current version being v1.0.0. Generally, we will maintain and provide updates only for the latest version of the dataset. However, in cases where significant updates occur or when older versions are required for validating previous research, we may exceptionally retain previous dataset versions for a period of up to one year. For a detailed list of changes made in each version, check out our CHANGELOG.
173
174
## Contributing
175
176
Contributions to enhance the usability and functionality of this dataset are always welcomed. If you're interested in contributing, feel free to fork this repository, make your changes, and then submit a pull request. For significant changes, please first open an issue to discuss the proposed alterations.
177
178
## Contact
179
180
For any questions or concerns regarding this dataset, please feel free to reach out to us ([seongsu@kaist.ac.kr](mailto:seongsu@kaist.ac.kr) or [kyungdaeun@kaist.ac.kr](mailto:kyungdaeun@kaist.ac.kr)). We appreciate your interest and are eager to assist.
181
182
## Acknowledgements
183
184
More details will be provided soon.
185
186
## Citation
187
188
When you use the EHRXQA dataset, we would appreciate it if you cite the following:
189
```
190
@article{bae2023ehrxqa,
191
  title={EHRXQA: A Multi-Modal Question Answering Dataset for Electronic Health Records with Chest X-ray Images},
192
  author={Bae, Seongsu and Kyung, Daeun and Ryu, Jaehee and Cho, Eunbyeol and Lee, Gyubok and Kweon, Sunjun and Oh, Jungwoo and Ji, Lei and Chang, Eric I and Kim, Tackeun and others},
193
  journal={arXiv preprint arXiv:2310.18652},
194
  year={2023}
195
}
196
```
197
198
## License
199
200
The code in this repository is provided under the terms of the MIT License. The final output of the dataset created using this code, the EHRXQA, is subject to the terms and conditions of the original datasets from Physionet: [MIMIC-CXR-JPG License](https://physionet.org/content/mimic-cxr/view-license/2.0.0/), [Chest ImaGenome License](https://physionet.org/content/chest-imagenome/view-license/1.0.0/), and [MIMIC-IV License](https://physionet.org/content/mimiciv/view-license/2.2/).
201