Data: De-identified Image Specialty: Radiology Respiratory Medicine Medical Imaging Technique: X-Ray Medical Imaging Region: Chest Lungs Clinical Purpose: Diagnosis Task: Classification License: Unknown

Switch to unified view

a/README.md b/README.md
1
---
1
# Dataset Card for NIH Chest X-ray dataset
2
annotations_creators:
2
3
- machine-generated
4
- expert-generated
5
language_creators:
6
- machine-generated
7
- expert-generated
8
language:
9
- en
10
license:
11
- unknown
12
multilinguality:
13
- monolingual
14
pretty_name: NIH-CXR14
15
paperswithcode_id: chestx-ray14
16
size_categories:
17
- 100K<n<1M
18
task_categories:
19
- image-classification
20
task_ids:
21
- multi-class-image-classification
22
---
23
24
# Dataset Card for NIH Chest X-ray dataset
25
26
## Table of Contents
3
## Table of Contents
27
4
28
- [Table of Contents](#table-of-contents)
5
- [Table of Contents](#table-of-contents)
29
- [Dataset Description](#dataset-description)
6
- [Dataset Description](#dataset-description)
30
  - [Dataset Summary](#dataset-summary)
7
  - [Dataset Summary](#dataset-summary)
31
  - [Languages](#languages)
8
  - [Languages](#languages)
32
- [Dataset Structure](#dataset-structure)
9
- [Dataset Structure](#dataset-structure)
33
  - [Data Instances](#data-instances)
10
  - [Data Instances](#data-instances)
34
  - [Data Fields](#data-fields)
11
  - [Data Fields](#data-fields)
35
  - [Data Splits](#data-splits)
12
  - [Data Splits](#data-splits)
36
- [Dataset Creation](#dataset-creation)
13
- [Dataset Creation](#dataset-creation)
37
  - [Curation Rationale](#curation-rationale)
14
  - [Curation Rationale](#curation-rationale)
38
  - [Source Data](#source-data)
15
  - [Source Data](#source-data)
39
  - [Annotations](#annotations)
16
  - [Annotations](#annotations)
40
  - [Personal and Sensitive Information](#personal-and-sensitive-information)
17
  - [Personal and Sensitive Information](#personal-and-sensitive-information)
41
- [Considerations for Using the Data](#considerations-for-using-the-data)
18
- [Considerations for Using the Data](#considerations-for-using-the-data)
42
  - [Social Impact of Dataset](#social-impact-of-dataset)
19
  - [Social Impact of Dataset](#social-impact-of-dataset)
43
  - [Discussion of Biases](#discussion-of-biases)
20
  - [Discussion of Biases](#discussion-of-biases)
44
  - [Other Known Limitations](#other-known-limitations)
21
  - [Other Known Limitations](#other-known-limitations)
45
- [Additional Information](#additional-information)
22
- [Additional Information](#additional-information)
46
  - [Dataset Curators](#dataset-curators)
23
  - [Dataset Curators](#dataset-curators)
47
  - [Licensing Information](#licensing-information)
24
  - [Licensing Information](#licensing-information)
48
  - [Citation Information](#citation-information)
25
  - [Citation Information](#citation-information)
49
  - [Contributions](#contributions)
26
  - [Contributions](#contributions)
50
27
51
## Dataset Description
28
## Dataset Description
52
29
53
- **Homepage:** [NIH Chest X-ray Dataset of 10 Common Thorax Disease Categories](https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345)
30
- **Homepage:** [NIH Chest X-ray Dataset of 10 Common Thorax Disease Categories](https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345)
54
- **Repository:**
31
- **Repository:**
55
- **Paper:** [ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases](https://arxiv.org/abs/1705.02315)
32
- **Paper:** [ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases](https://arxiv.org/abs/1705.02315)
56
- **Leaderboard:**
33
- **Leaderboard:**
57
- **Point of Contact:** rms@nih.gov
34
- **Point of Contact:** rms@nih.gov
58
35
59
### Dataset Summary
36
### Dataset Summary
60
37
61
_ChestX-ray dataset comprises 112,120 frontal-view X-ray images of 30,805 unique patients with the text-mined fourteen disease image labels (where each image can have multi-labels), mined from the associated radiological reports using natural language processing. Fourteen common thoracic pathologies include Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural_thickening, Cardiomegaly, Nodule, Mass and Hernia, which is an extension of the 8 common disease patterns listed in our CVPR2017 paper. Note that original radiology reports (associated with these chest x-ray studies) are not meant to be publicly shared for many reasons. The text-mined disease labels are expected to have accuracy >90%.Please find more details and benchmark performance of trained models based on 14 disease labels in our arxiv paper: [1705.02315](https://arxiv.org/abs/1705.02315)_
38
_ChestX-ray dataset comprises 112,120 frontal-view X-ray images of 30,805 unique patients with the text-mined fourteen disease image labels (where each image can have multi-labels), mined from the associated radiological reports using natural language processing. Fourteen common thoracic pathologies include Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural_thickening, Cardiomegaly, Nodule, Mass and Hernia, which is an extension of the 8 common disease patterns listed in our CVPR2017 paper. Note that original radiology reports (associated with these chest x-ray studies) are not meant to be publicly shared for many reasons. The text-mined disease labels are expected to have accuracy >90%.Please find more details and benchmark performance of trained models based on 14 disease labels in our arxiv paper: [1705.02315](https://arxiv.org/abs/1705.02315)_
62
39
63
![](https://huggingface.co/datasets/alkzar90/NIH-Chest-X-ray-dataset/resolve/main/data/nih-chest-xray14-portraint.png)
40
![](https://huggingface.co/datasets/alkzar90/NIH-Chest-X-ray-dataset/resolve/main/data/nih-chest-xray14-portraint.png)
64
41
65
## Dataset Structure
42
## Dataset Structure
66
43
67
### Data Instances
44
### Data Instances
68
45
69
A sample from the training set is provided below:
46
A sample from the training set is provided below:
70
47
71
```
48
```
72
{'image_file_path': '/root/.cache/huggingface/datasets/downloads/extracted/95db46f21d556880cf0ecb11d45d5ba0b58fcb113c9a0fff2234eba8f74fe22a/images/00000798_022.png',
49
{'image_file_path': '/root/.cache/huggingface/datasets/downloads/extracted/95db46f21d556880cf0ecb11d45d5ba0b58fcb113c9a0fff2234eba8f74fe22a/images/00000798_022.png',
73
 'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=1024x1024 at 0x7F2151B144D0>,
50
 'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=1024x1024 at 0x7F2151B144D0>,
74
 'labels': [9, 3]}
51
 'labels': [9, 3]}
75
```
52
```
76
53
77
### Data Fields
54
### Data Fields
78
55
79
The data instances have the following fields:
56
The data instances have the following fields:
80
- `image_file_path` a `str` with the image path
57
- `image_file_path` a `str` with the image path
81
- `image`: A `PIL.Image.Image` object containing the image. Note that when accessing the image column: `dataset[0]["image"]` the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the `"image"` column, *i.e.* `dataset[0]["image"]` should **always** be preferred over `dataset["image"][0]`.
58
- `image`: A `PIL.Image.Image` object containing the image. Note that when accessing the image column: `dataset[0]["image"]` the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the `"image"` column, *i.e.* `dataset[0]["image"]` should **always** be preferred over `dataset["image"][0]`.
82
- `labels`: an `int` classification label.
59
- `labels`: an `int` classification label.
83
<details>
60
<details>
84
  <summary>Class Label Mappings</summary>
61
  <summary>Class Label Mappings</summary>
85
  ```json
62
  ```json
86
  {
63
  {
87
    "No Finding": 0,
64
    "No Finding": 0,
88
    "Atelectasis": 1,
65
    "Atelectasis": 1,
89
    "Cardiomegaly": 2,
66
    "Cardiomegaly": 2,
90
    "Effusion": 3,
67
    "Effusion": 3,
91
    "Infiltration": 4,
68
    "Infiltration": 4,
92
    "Mass": 5,
69
    "Mass": 5,
93
    "Nodule": 6,
70
    "Nodule": 6,
94
    "Pneumonia": 7,
71
    "Pneumonia": 7,
95
    "Pneumothorax": 8,
72
    "Pneumothorax": 8,
96
    "Consolidation": 9,
73
    "Consolidation": 9,
97
    "Edema": 10,
74
    "Edema": 10,
98
    "Emphysema": 11,
75
    "Emphysema": 11,
99
    "Fibrosis": 12,
76
    "Fibrosis": 12,
100
    "Pleural_Thickening": 13,
77
    "Pleural_Thickening": 13,
101
    "Hernia": 14
78
    "Hernia": 14
102
 }
79
 }
103
  ```
80
  ```
104
</details>
81
</details>
105
82
106
**Label distribution on the dataset:**
83
**Label distribution on the dataset:**
107
84
108
| labels             |   obs |       freq |
85
| labels             |   obs |       freq |
109
|:-------------------|------:|-----------:|
86
|:-------------------|------:|-----------:|
110
| No Finding         | 60361 | 0.426468   |
87
| No Finding         | 60361 | 0.426468   |
111
| Infiltration       | 19894 | 0.140557   |
88
| Infiltration       | 19894 | 0.140557   |
112
| Effusion           | 13317 | 0.0940885  |
89
| Effusion           | 13317 | 0.0940885  |
113
| Atelectasis        | 11559 | 0.0816677  |
90
| Atelectasis        | 11559 | 0.0816677  |
114
| Nodule             |  6331 | 0.0447304  |
91
| Nodule             |  6331 | 0.0447304  |
115
| Mass               |  5782 | 0.0408515  |
92
| Mass               |  5782 | 0.0408515  |
116
| Pneumothorax       |  5302 | 0.0374602  |
93
| Pneumothorax       |  5302 | 0.0374602  |
117
| Consolidation      |  4667 | 0.0329737  |
94
| Consolidation      |  4667 | 0.0329737  |
118
| Pleural_Thickening |  3385 | 0.023916   |
95
| Pleural_Thickening |  3385 | 0.023916   |
119
| Cardiomegaly       |  2776 | 0.0196132  |
96
| Cardiomegaly       |  2776 | 0.0196132  |
120
| Emphysema          |  2516 | 0.0177763  |
97
| Emphysema          |  2516 | 0.0177763  |
121
| Edema              |  2303 | 0.0162714  |
98
| Edema              |  2303 | 0.0162714  |
122
| Fibrosis           |  1686 | 0.0119121  |
99
| Fibrosis           |  1686 | 0.0119121  |
123
| Pneumonia          |  1431 | 0.0101104  |
100
| Pneumonia          |  1431 | 0.0101104  |
124
| Hernia             |   227 | 0.00160382 |
101
| Hernia             |   227 | 0.00160382 |
125
102
126
### Data Splits
103
### Data Splits
127
104
128
 
105
 
129
|             |train| test|
106
|             |train| test|
130
|-------------|----:|----:|
107
|-------------|----:|----:|
131
|# of examples|86524|25596|
108
|# of examples|86524|25596|
132
109
133
110
134
**Label distribution by dataset split:**
111
**Label distribution by dataset split:**
135
112
136
| labels             |   ('Train', 'obs') |   ('Train', 'freq') |   ('Test', 'obs') |   ('Test', 'freq') |
113
| labels             |   ('Train', 'obs') |   ('Train', 'freq') |   ('Test', 'obs') |   ('Test', 'freq') |
137
|:-------------------|-------------------:|--------------------:|------------------:|-------------------:|
114
|:-------------------|-------------------:|--------------------:|------------------:|-------------------:|
138
| No Finding         |              50500 |          0.483392   |              9861 |         0.266032   |
115
| No Finding         |              50500 |          0.483392   |              9861 |         0.266032   |
139
| Infiltration       |              13782 |          0.131923   |              6112 |         0.164891   |
116
| Infiltration       |              13782 |          0.131923   |              6112 |         0.164891   |
140
| Effusion           |               8659 |          0.082885   |              4658 |         0.125664   |
117
| Effusion           |               8659 |          0.082885   |              4658 |         0.125664   |
141
| Atelectasis        |               8280 |          0.0792572  |              3279 |         0.0884614  |
118
| Atelectasis        |               8280 |          0.0792572  |              3279 |         0.0884614  |
142
| Nodule             |               4708 |          0.0450656  |              1623 |         0.0437856  |
119
| Nodule             |               4708 |          0.0450656  |              1623 |         0.0437856  |
143
| Mass               |               4034 |          0.038614   |              1748 |         0.0471578  |
120
| Mass               |               4034 |          0.038614   |              1748 |         0.0471578  |
144
| Consolidation      |               2852 |          0.0272997  |              1815 |         0.0489654  |
121
| Consolidation      |               2852 |          0.0272997  |              1815 |         0.0489654  |
145
| Pneumothorax       |               2637 |          0.0252417  |              2665 |         0.0718968  |
122
| Pneumothorax       |               2637 |          0.0252417  |              2665 |         0.0718968  |
146
| Pleural_Thickening |               2242 |          0.0214607  |              1143 |         0.0308361  |
123
| Pleural_Thickening |               2242 |          0.0214607  |              1143 |         0.0308361  |
147
| Cardiomegaly       |               1707 |          0.0163396  |              1069 |         0.0288397  |
124
| Cardiomegaly       |               1707 |          0.0163396  |              1069 |         0.0288397  |
148
| Emphysema          |               1423 |          0.0136211  |              1093 |         0.0294871  |
125
| Emphysema          |               1423 |          0.0136211  |              1093 |         0.0294871  |
149
| Edema              |               1378 |          0.0131904  |               925 |         0.0249548  |
126
| Edema              |               1378 |          0.0131904  |               925 |         0.0249548  |
150
| Fibrosis           |               1251 |          0.0119747  |               435 |         0.0117355  |
127
| Fibrosis           |               1251 |          0.0119747  |               435 |         0.0117355  |
151
| Pneumonia          |                876 |          0.00838518 |               555 |         0.0149729  |
128
| Pneumonia          |                876 |          0.00838518 |               555 |         0.0149729  |
152
| Hernia             |                141 |          0.00134967 |                86 |         0.00232012 |
129
| Hernia             |                141 |          0.00134967 |                86 |         0.00232012 |
153
130
154
## Dataset Creation
131
## Dataset Creation
155
132
156
### Curation Rationale
133
### Curation Rationale
157
134
158
[More Information Needed]
135
[More Information Needed]
159
136
160
### Source Data
137
### Source Data
161
138
162
#### Initial Data Collection and Normalization
139
#### Initial Data Collection and Normalization
163
140
164
[More Information Needed]
141
[More Information Needed]
165
142
166
#### Who are the source language producers?
143
#### Who are the source language producers?
167
144
168
[More Information Needed]
145
[More Information Needed]
169
146
170
### Annotations
147
### Annotations
171
148
172
#### Annotation process
149
#### Annotation process
173
150
174
[More Information Needed]
151
[More Information Needed]
175
152
176
#### Who are the annotators?
153
#### Who are the annotators?
177
154
178
[More Information Needed]
155
[More Information Needed]
179
156
180
### Personal and Sensitive Information
157
### Personal and Sensitive Information
181
158
182
[More Information Needed]
159
[More Information Needed]
183
160
184
## Considerations for Using the Data
161
## Considerations for Using the Data
185
162
186
### Social Impact of Dataset
163
### Social Impact of Dataset
187
164
188
[More Information Needed]
165
[More Information Needed]
189
166
190
### Discussion of Biases
167
### Discussion of Biases
191
168
192
[More Information Needed]
169
[More Information Needed]
193
170
194
### Other Known Limitations
171
### Other Known Limitations
195
172
196
[More Information Needed]
173
[More Information Needed]
197
174
198
## Additional Information
175
## Additional Information
199
176
200
### Dataset Curators
177
### Dataset Curators
201
178
202
[More Information Needed]
179
[More Information Needed]
203
180
204
### License and attribution 
181
### License and attribution 
205
182
206
There are no restrictions on the use of the NIH chest x-ray images. However, the dataset has the following attribution requirements:
183
There are no restrictions on the use of the NIH chest x-ray images. However, the dataset has the following attribution requirements:
207
184
208
- Provide a link to the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC
185
- Provide a link to the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC
209
- Include a citation to the CVPR 2017 paper (see Citation information section)
186
- Include a citation to the CVPR 2017 paper (see Citation information section)
210
- Acknowledge that the NIH Clinical Center is the data provider
187
- Acknowledge that the NIH Clinical Center is the data provider
211
188
212
189
213
### Citation Information
190
### Citation Information
214
191
215
```
192
```
216
@inproceedings{Wang_2017,
193
@inproceedings{Wang_2017,
217
    doi = {10.1109/cvpr.2017.369},
194
    doi = {10.1109/cvpr.2017.369},
218
    url = {https://doi.org/10.1109%2Fcvpr.2017.369},
195
    url = {https://doi.org/10.1109%2Fcvpr.2017.369},
219
    year = 2017,
196
    year = 2017,
220
    month = {jul},
197
    month = {jul},
221
    publisher = {{IEEE}
198
    publisher = {{IEEE}
222
},
199
},
223
    author = {Xiaosong Wang and Yifan Peng and Le Lu and Zhiyong Lu and Mohammadhadi Bagheri and Ronald M. Summers},
200
    author = {Xiaosong Wang and Yifan Peng and Le Lu and Zhiyong Lu and Mohammadhadi Bagheri and Ronald M. Summers},
224
    title = {{ChestX}-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases},
201
    title = {{ChestX}-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases},
225
    booktitle = {2017 {IEEE} Conference on Computer Vision and Pattern Recognition ({CVPR})}
202
    booktitle = {2017 {IEEE} Conference on Computer Vision and Pattern Recognition ({CVPR})}
226
}
203
}
227
```
204
```
228
205
229
### Contributions
206
### Contributions
230
207
231
Thanks to [@alcazar90](https://github.com/alcazar90) for adding this dataset.
208
Thanks to [@alcazar90](https://github.com/alcazar90) for adding this dataset.
232
209