|
a/README.md |
|
b/README.md |
1 |
--- |
1 |
# Dataset Card for NIH Chest X-ray dataset |
2 |
annotations_creators: |
2 |
|
3 |
- machine-generated |
|
|
4 |
- expert-generated |
|
|
5 |
language_creators: |
|
|
6 |
- machine-generated |
|
|
7 |
- expert-generated |
|
|
8 |
language: |
|
|
9 |
- en |
|
|
10 |
license: |
|
|
11 |
- unknown |
|
|
12 |
multilinguality: |
|
|
13 |
- monolingual |
|
|
14 |
pretty_name: NIH-CXR14 |
|
|
15 |
paperswithcode_id: chestx-ray14 |
|
|
16 |
size_categories: |
|
|
17 |
- 100K<n<1M |
|
|
18 |
task_categories: |
|
|
19 |
- image-classification |
|
|
20 |
task_ids: |
|
|
21 |
- multi-class-image-classification |
|
|
22 |
--- |
|
|
23 |
|
|
|
24 |
# Dataset Card for NIH Chest X-ray dataset |
|
|
25 |
|
|
|
26 |
## Table of Contents |
3 |
## Table of Contents |
27 |
|
4 |
|
28 |
- [Table of Contents](#table-of-contents) |
5 |
- [Table of Contents](#table-of-contents)
|
29 |
- [Dataset Description](#dataset-description) |
6 |
- [Dataset Description](#dataset-description)
|
30 |
- [Dataset Summary](#dataset-summary) |
7 |
- [Dataset Summary](#dataset-summary)
|
31 |
- [Languages](#languages) |
8 |
- [Languages](#languages)
|
32 |
- [Dataset Structure](#dataset-structure) |
9 |
- [Dataset Structure](#dataset-structure)
|
33 |
- [Data Instances](#data-instances) |
10 |
- [Data Instances](#data-instances)
|
34 |
- [Data Fields](#data-fields) |
11 |
- [Data Fields](#data-fields)
|
35 |
- [Data Splits](#data-splits) |
12 |
- [Data Splits](#data-splits)
|
36 |
- [Dataset Creation](#dataset-creation) |
13 |
- [Dataset Creation](#dataset-creation)
|
37 |
- [Curation Rationale](#curation-rationale) |
14 |
- [Curation Rationale](#curation-rationale)
|
38 |
- [Source Data](#source-data) |
15 |
- [Source Data](#source-data)
|
39 |
- [Annotations](#annotations) |
16 |
- [Annotations](#annotations)
|
40 |
- [Personal and Sensitive Information](#personal-and-sensitive-information) |
17 |
- [Personal and Sensitive Information](#personal-and-sensitive-information)
|
41 |
- [Considerations for Using the Data](#considerations-for-using-the-data) |
18 |
- [Considerations for Using the Data](#considerations-for-using-the-data)
|
42 |
- [Social Impact of Dataset](#social-impact-of-dataset) |
19 |
- [Social Impact of Dataset](#social-impact-of-dataset)
|
43 |
- [Discussion of Biases](#discussion-of-biases) |
20 |
- [Discussion of Biases](#discussion-of-biases)
|
44 |
- [Other Known Limitations](#other-known-limitations) |
21 |
- [Other Known Limitations](#other-known-limitations)
|
45 |
- [Additional Information](#additional-information) |
22 |
- [Additional Information](#additional-information)
|
46 |
- [Dataset Curators](#dataset-curators) |
23 |
- [Dataset Curators](#dataset-curators)
|
47 |
- [Licensing Information](#licensing-information) |
24 |
- [Licensing Information](#licensing-information)
|
48 |
- [Citation Information](#citation-information) |
25 |
- [Citation Information](#citation-information)
|
49 |
- [Contributions](#contributions) |
26 |
- [Contributions](#contributions) |
50 |
|
27 |
|
51 |
## Dataset Description |
28 |
## Dataset Description |
52 |
|
29 |
|
53 |
- **Homepage:** [NIH Chest X-ray Dataset of 10 Common Thorax Disease Categories](https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345) |
30 |
- **Homepage:** [NIH Chest X-ray Dataset of 10 Common Thorax Disease Categories](https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345)
|
54 |
- **Repository:** |
31 |
- **Repository:**
|
55 |
- **Paper:** [ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases](https://arxiv.org/abs/1705.02315) |
32 |
- **Paper:** [ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases](https://arxiv.org/abs/1705.02315)
|
56 |
- **Leaderboard:** |
33 |
- **Leaderboard:**
|
57 |
- **Point of Contact:** rms@nih.gov |
34 |
- **Point of Contact:** rms@nih.gov |
58 |
|
35 |
|
59 |
### Dataset Summary |
36 |
### Dataset Summary |
60 |
|
37 |
|
61 |
_ChestX-ray dataset comprises 112,120 frontal-view X-ray images of 30,805 unique patients with the text-mined fourteen disease image labels (where each image can have multi-labels), mined from the associated radiological reports using natural language processing. Fourteen common thoracic pathologies include Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural_thickening, Cardiomegaly, Nodule, Mass and Hernia, which is an extension of the 8 common disease patterns listed in our CVPR2017 paper. Note that original radiology reports (associated with these chest x-ray studies) are not meant to be publicly shared for many reasons. The text-mined disease labels are expected to have accuracy >90%.Please find more details and benchmark performance of trained models based on 14 disease labels in our arxiv paper: [1705.02315](https://arxiv.org/abs/1705.02315)_ |
38 |
_ChestX-ray dataset comprises 112,120 frontal-view X-ray images of 30,805 unique patients with the text-mined fourteen disease image labels (where each image can have multi-labels), mined from the associated radiological reports using natural language processing. Fourteen common thoracic pathologies include Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural_thickening, Cardiomegaly, Nodule, Mass and Hernia, which is an extension of the 8 common disease patterns listed in our CVPR2017 paper. Note that original radiology reports (associated with these chest x-ray studies) are not meant to be publicly shared for many reasons. The text-mined disease labels are expected to have accuracy >90%.Please find more details and benchmark performance of trained models based on 14 disease labels in our arxiv paper: [1705.02315](https://arxiv.org/abs/1705.02315)_ |
62 |
|
39 |
|
63 |
 |
40 |
 |
64 |
|
41 |
|
65 |
## Dataset Structure |
42 |
## Dataset Structure |
66 |
|
43 |
|
67 |
### Data Instances |
44 |
### Data Instances |
68 |
|
45 |
|
69 |
A sample from the training set is provided below: |
46 |
A sample from the training set is provided below: |
70 |
|
47 |
|
71 |
``` |
48 |
```
|
72 |
{'image_file_path': '/root/.cache/huggingface/datasets/downloads/extracted/95db46f21d556880cf0ecb11d45d5ba0b58fcb113c9a0fff2234eba8f74fe22a/images/00000798_022.png', |
49 |
{'image_file_path': '/root/.cache/huggingface/datasets/downloads/extracted/95db46f21d556880cf0ecb11d45d5ba0b58fcb113c9a0fff2234eba8f74fe22a/images/00000798_022.png',
|
73 |
'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=1024x1024 at 0x7F2151B144D0>, |
50 |
'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=1024x1024 at 0x7F2151B144D0>,
|
74 |
'labels': [9, 3]} |
51 |
'labels': [9, 3]}
|
75 |
``` |
52 |
``` |
76 |
|
53 |
|
77 |
### Data Fields |
54 |
### Data Fields |
78 |
|
55 |
|
79 |
The data instances have the following fields: |
56 |
The data instances have the following fields:
|
80 |
- `image_file_path` a `str` with the image path |
57 |
- `image_file_path` a `str` with the image path
|
81 |
- `image`: A `PIL.Image.Image` object containing the image. Note that when accessing the image column: `dataset[0]["image"]` the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the `"image"` column, *i.e.* `dataset[0]["image"]` should **always** be preferred over `dataset["image"][0]`. |
58 |
- `image`: A `PIL.Image.Image` object containing the image. Note that when accessing the image column: `dataset[0]["image"]` the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the `"image"` column, *i.e.* `dataset[0]["image"]` should **always** be preferred over `dataset["image"][0]`.
|
82 |
- `labels`: an `int` classification label. |
59 |
- `labels`: an `int` classification label.
|
83 |
<details> |
60 |
<details>
|
84 |
<summary>Class Label Mappings</summary> |
61 |
<summary>Class Label Mappings</summary>
|
85 |
```json |
62 |
```json
|
86 |
{ |
63 |
{
|
87 |
"No Finding": 0, |
64 |
"No Finding": 0,
|
88 |
"Atelectasis": 1, |
65 |
"Atelectasis": 1,
|
89 |
"Cardiomegaly": 2, |
66 |
"Cardiomegaly": 2,
|
90 |
"Effusion": 3, |
67 |
"Effusion": 3,
|
91 |
"Infiltration": 4, |
68 |
"Infiltration": 4,
|
92 |
"Mass": 5, |
69 |
"Mass": 5,
|
93 |
"Nodule": 6, |
70 |
"Nodule": 6,
|
94 |
"Pneumonia": 7, |
71 |
"Pneumonia": 7,
|
95 |
"Pneumothorax": 8, |
72 |
"Pneumothorax": 8,
|
96 |
"Consolidation": 9, |
73 |
"Consolidation": 9,
|
97 |
"Edema": 10, |
74 |
"Edema": 10,
|
98 |
"Emphysema": 11, |
75 |
"Emphysema": 11,
|
99 |
"Fibrosis": 12, |
76 |
"Fibrosis": 12,
|
100 |
"Pleural_Thickening": 13, |
77 |
"Pleural_Thickening": 13,
|
101 |
"Hernia": 14 |
78 |
"Hernia": 14
|
102 |
} |
79 |
}
|
103 |
``` |
80 |
```
|
104 |
</details> |
81 |
</details> |
105 |
|
82 |
|
106 |
**Label distribution on the dataset:** |
83 |
**Label distribution on the dataset:** |
107 |
|
84 |
|
108 |
| labels | obs | freq | |
85 |
| labels | obs | freq |
|
109 |
|:-------------------|------:|-----------:| |
86 |
|:-------------------|------:|-----------:|
|
110 |
| No Finding | 60361 | 0.426468 | |
87 |
| No Finding | 60361 | 0.426468 |
|
111 |
| Infiltration | 19894 | 0.140557 | |
88 |
| Infiltration | 19894 | 0.140557 |
|
112 |
| Effusion | 13317 | 0.0940885 | |
89 |
| Effusion | 13317 | 0.0940885 |
|
113 |
| Atelectasis | 11559 | 0.0816677 | |
90 |
| Atelectasis | 11559 | 0.0816677 |
|
114 |
| Nodule | 6331 | 0.0447304 | |
91 |
| Nodule | 6331 | 0.0447304 |
|
115 |
| Mass | 5782 | 0.0408515 | |
92 |
| Mass | 5782 | 0.0408515 |
|
116 |
| Pneumothorax | 5302 | 0.0374602 | |
93 |
| Pneumothorax | 5302 | 0.0374602 |
|
117 |
| Consolidation | 4667 | 0.0329737 | |
94 |
| Consolidation | 4667 | 0.0329737 |
|
118 |
| Pleural_Thickening | 3385 | 0.023916 | |
95 |
| Pleural_Thickening | 3385 | 0.023916 |
|
119 |
| Cardiomegaly | 2776 | 0.0196132 | |
96 |
| Cardiomegaly | 2776 | 0.0196132 |
|
120 |
| Emphysema | 2516 | 0.0177763 | |
97 |
| Emphysema | 2516 | 0.0177763 |
|
121 |
| Edema | 2303 | 0.0162714 | |
98 |
| Edema | 2303 | 0.0162714 |
|
122 |
| Fibrosis | 1686 | 0.0119121 | |
99 |
| Fibrosis | 1686 | 0.0119121 |
|
123 |
| Pneumonia | 1431 | 0.0101104 | |
100 |
| Pneumonia | 1431 | 0.0101104 |
|
124 |
| Hernia | 227 | 0.00160382 | |
101 |
| Hernia | 227 | 0.00160382 | |
125 |
|
102 |
|
126 |
### Data Splits |
103 |
### Data Splits |
127 |
|
104 |
|
128 |
|
105 |
|
129 |
| |train| test| |
106 |
| |train| test|
|
130 |
|-------------|----:|----:| |
107 |
|-------------|----:|----:|
|
131 |
|# of examples|86524|25596| |
108 |
|# of examples|86524|25596| |
132 |
|
109 |
|
133 |
|
110 |
|
134 |
**Label distribution by dataset split:** |
111 |
**Label distribution by dataset split:** |
135 |
|
112 |
|
136 |
| labels | ('Train', 'obs') | ('Train', 'freq') | ('Test', 'obs') | ('Test', 'freq') | |
113 |
| labels | ('Train', 'obs') | ('Train', 'freq') | ('Test', 'obs') | ('Test', 'freq') |
|
137 |
|:-------------------|-------------------:|--------------------:|------------------:|-------------------:| |
114 |
|:-------------------|-------------------:|--------------------:|------------------:|-------------------:|
|
138 |
| No Finding | 50500 | 0.483392 | 9861 | 0.266032 | |
115 |
| No Finding | 50500 | 0.483392 | 9861 | 0.266032 |
|
139 |
| Infiltration | 13782 | 0.131923 | 6112 | 0.164891 | |
116 |
| Infiltration | 13782 | 0.131923 | 6112 | 0.164891 |
|
140 |
| Effusion | 8659 | 0.082885 | 4658 | 0.125664 | |
117 |
| Effusion | 8659 | 0.082885 | 4658 | 0.125664 |
|
141 |
| Atelectasis | 8280 | 0.0792572 | 3279 | 0.0884614 | |
118 |
| Atelectasis | 8280 | 0.0792572 | 3279 | 0.0884614 |
|
142 |
| Nodule | 4708 | 0.0450656 | 1623 | 0.0437856 | |
119 |
| Nodule | 4708 | 0.0450656 | 1623 | 0.0437856 |
|
143 |
| Mass | 4034 | 0.038614 | 1748 | 0.0471578 | |
120 |
| Mass | 4034 | 0.038614 | 1748 | 0.0471578 |
|
144 |
| Consolidation | 2852 | 0.0272997 | 1815 | 0.0489654 | |
121 |
| Consolidation | 2852 | 0.0272997 | 1815 | 0.0489654 |
|
145 |
| Pneumothorax | 2637 | 0.0252417 | 2665 | 0.0718968 | |
122 |
| Pneumothorax | 2637 | 0.0252417 | 2665 | 0.0718968 |
|
146 |
| Pleural_Thickening | 2242 | 0.0214607 | 1143 | 0.0308361 | |
123 |
| Pleural_Thickening | 2242 | 0.0214607 | 1143 | 0.0308361 |
|
147 |
| Cardiomegaly | 1707 | 0.0163396 | 1069 | 0.0288397 | |
124 |
| Cardiomegaly | 1707 | 0.0163396 | 1069 | 0.0288397 |
|
148 |
| Emphysema | 1423 | 0.0136211 | 1093 | 0.0294871 | |
125 |
| Emphysema | 1423 | 0.0136211 | 1093 | 0.0294871 |
|
149 |
| Edema | 1378 | 0.0131904 | 925 | 0.0249548 | |
126 |
| Edema | 1378 | 0.0131904 | 925 | 0.0249548 |
|
150 |
| Fibrosis | 1251 | 0.0119747 | 435 | 0.0117355 | |
127 |
| Fibrosis | 1251 | 0.0119747 | 435 | 0.0117355 |
|
151 |
| Pneumonia | 876 | 0.00838518 | 555 | 0.0149729 | |
128 |
| Pneumonia | 876 | 0.00838518 | 555 | 0.0149729 |
|
152 |
| Hernia | 141 | 0.00134967 | 86 | 0.00232012 | |
129 |
| Hernia | 141 | 0.00134967 | 86 | 0.00232012 | |
153 |
|
130 |
|
154 |
## Dataset Creation |
131 |
## Dataset Creation |
155 |
|
132 |
|
156 |
### Curation Rationale |
133 |
### Curation Rationale |
157 |
|
134 |
|
158 |
[More Information Needed] |
135 |
[More Information Needed] |
159 |
|
136 |
|
160 |
### Source Data |
137 |
### Source Data |
161 |
|
138 |
|
162 |
#### Initial Data Collection and Normalization |
139 |
#### Initial Data Collection and Normalization |
163 |
|
140 |
|
164 |
[More Information Needed] |
141 |
[More Information Needed] |
165 |
|
142 |
|
166 |
#### Who are the source language producers? |
143 |
#### Who are the source language producers? |
167 |
|
144 |
|
168 |
[More Information Needed] |
145 |
[More Information Needed] |
169 |
|
146 |
|
170 |
### Annotations |
147 |
### Annotations |
171 |
|
148 |
|
172 |
#### Annotation process |
149 |
#### Annotation process |
173 |
|
150 |
|
174 |
[More Information Needed] |
151 |
[More Information Needed] |
175 |
|
152 |
|
176 |
#### Who are the annotators? |
153 |
#### Who are the annotators? |
177 |
|
154 |
|
178 |
[More Information Needed] |
155 |
[More Information Needed] |
179 |
|
156 |
|
180 |
### Personal and Sensitive Information |
157 |
### Personal and Sensitive Information |
181 |
|
158 |
|
182 |
[More Information Needed] |
159 |
[More Information Needed] |
183 |
|
160 |
|
184 |
## Considerations for Using the Data |
161 |
## Considerations for Using the Data |
185 |
|
162 |
|
186 |
### Social Impact of Dataset |
163 |
### Social Impact of Dataset |
187 |
|
164 |
|
188 |
[More Information Needed] |
165 |
[More Information Needed] |
189 |
|
166 |
|
190 |
### Discussion of Biases |
167 |
### Discussion of Biases |
191 |
|
168 |
|
192 |
[More Information Needed] |
169 |
[More Information Needed] |
193 |
|
170 |
|
194 |
### Other Known Limitations |
171 |
### Other Known Limitations |
195 |
|
172 |
|
196 |
[More Information Needed] |
173 |
[More Information Needed] |
197 |
|
174 |
|
198 |
## Additional Information |
175 |
## Additional Information |
199 |
|
176 |
|
200 |
### Dataset Curators |
177 |
### Dataset Curators |
201 |
|
178 |
|
202 |
[More Information Needed] |
179 |
[More Information Needed] |
203 |
|
180 |
|
204 |
### License and attribution |
181 |
### License and attribution |
205 |
|
182 |
|
206 |
There are no restrictions on the use of the NIH chest x-ray images. However, the dataset has the following attribution requirements: |
183 |
There are no restrictions on the use of the NIH chest x-ray images. However, the dataset has the following attribution requirements: |
207 |
|
184 |
|
208 |
- Provide a link to the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC |
185 |
- Provide a link to the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC
|
209 |
- Include a citation to the CVPR 2017 paper (see Citation information section) |
186 |
- Include a citation to the CVPR 2017 paper (see Citation information section)
|
210 |
- Acknowledge that the NIH Clinical Center is the data provider |
187 |
- Acknowledge that the NIH Clinical Center is the data provider |
211 |
|
188 |
|
212 |
|
189 |
|
213 |
### Citation Information |
190 |
### Citation Information |
214 |
|
191 |
|
215 |
``` |
192 |
```
|
216 |
@inproceedings{Wang_2017, |
193 |
@inproceedings{Wang_2017,
|
217 |
doi = {10.1109/cvpr.2017.369}, |
194 |
doi = {10.1109/cvpr.2017.369},
|
218 |
url = {https://doi.org/10.1109%2Fcvpr.2017.369}, |
195 |
url = {https://doi.org/10.1109%2Fcvpr.2017.369},
|
219 |
year = 2017, |
196 |
year = 2017,
|
220 |
month = {jul}, |
197 |
month = {jul},
|
221 |
publisher = {{IEEE} |
198 |
publisher = {{IEEE}
|
222 |
}, |
199 |
},
|
223 |
author = {Xiaosong Wang and Yifan Peng and Le Lu and Zhiyong Lu and Mohammadhadi Bagheri and Ronald M. Summers}, |
200 |
author = {Xiaosong Wang and Yifan Peng and Le Lu and Zhiyong Lu and Mohammadhadi Bagheri and Ronald M. Summers},
|
224 |
title = {{ChestX}-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases}, |
201 |
title = {{ChestX}-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases},
|
225 |
booktitle = {2017 {IEEE} Conference on Computer Vision and Pattern Recognition ({CVPR})} |
202 |
booktitle = {2017 {IEEE} Conference on Computer Vision and Pattern Recognition ({CVPR})}
|
226 |
} |
203 |
}
|
227 |
``` |
204 |
``` |
228 |
|
205 |
|
229 |
### Contributions |
206 |
### Contributions |
230 |
|
207 |
|
231 |
Thanks to [@alcazar90](https://github.com/alcazar90) for adding this dataset. |
208 |
Thanks to [@alcazar90](https://github.com/alcazar90) for adding this dataset. |
232 |
|
209 |
|