|
a |
|
b/README.md |
|
|
1 |
--- |
|
|
2 |
annotations_creators: |
|
|
3 |
- machine-generated |
|
|
4 |
- expert-generated |
|
|
5 |
language_creators: |
|
|
6 |
- machine-generated |
|
|
7 |
- expert-generated |
|
|
8 |
language: |
|
|
9 |
- en |
|
|
10 |
license: |
|
|
11 |
- unknown |
|
|
12 |
multilinguality: |
|
|
13 |
- monolingual |
|
|
14 |
pretty_name: NIH-CXR14 |
|
|
15 |
paperswithcode_id: chestx-ray14 |
|
|
16 |
size_categories: |
|
|
17 |
- 100K<n<1M |
|
|
18 |
task_categories: |
|
|
19 |
- image-classification |
|
|
20 |
task_ids: |
|
|
21 |
- multi-class-image-classification |
|
|
22 |
--- |
|
|
23 |
|
|
|
24 |
# Dataset Card for NIH Chest X-ray dataset |
|
|
25 |
|
|
|
26 |
## Table of Contents |
|
|
27 |
|
|
|
28 |
- [Table of Contents](#table-of-contents) |
|
|
29 |
- [Dataset Description](#dataset-description) |
|
|
30 |
- [Dataset Summary](#dataset-summary) |
|
|
31 |
- [Languages](#languages) |
|
|
32 |
- [Dataset Structure](#dataset-structure) |
|
|
33 |
- [Data Instances](#data-instances) |
|
|
34 |
- [Data Fields](#data-fields) |
|
|
35 |
- [Data Splits](#data-splits) |
|
|
36 |
- [Dataset Creation](#dataset-creation) |
|
|
37 |
- [Curation Rationale](#curation-rationale) |
|
|
38 |
- [Source Data](#source-data) |
|
|
39 |
- [Annotations](#annotations) |
|
|
40 |
- [Personal and Sensitive Information](#personal-and-sensitive-information) |
|
|
41 |
- [Considerations for Using the Data](#considerations-for-using-the-data) |
|
|
42 |
- [Social Impact of Dataset](#social-impact-of-dataset) |
|
|
43 |
- [Discussion of Biases](#discussion-of-biases) |
|
|
44 |
- [Other Known Limitations](#other-known-limitations) |
|
|
45 |
- [Additional Information](#additional-information) |
|
|
46 |
- [Dataset Curators](#dataset-curators) |
|
|
47 |
- [Licensing Information](#licensing-information) |
|
|
48 |
- [Citation Information](#citation-information) |
|
|
49 |
- [Contributions](#contributions) |
|
|
50 |
|
|
|
51 |
## Dataset Description |
|
|
52 |
|
|
|
53 |
- **Homepage:** [NIH Chest X-ray Dataset of 10 Common Thorax Disease Categories](https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345) |
|
|
54 |
- **Repository:** |
|
|
55 |
- **Paper:** [ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases](https://arxiv.org/abs/1705.02315) |
|
|
56 |
- **Leaderboard:** |
|
|
57 |
- **Point of Contact:** rms@nih.gov |
|
|
58 |
|
|
|
59 |
### Dataset Summary |
|
|
60 |
|
|
|
61 |
_ChestX-ray dataset comprises 112,120 frontal-view X-ray images of 30,805 unique patients with the text-mined fourteen disease image labels (where each image can have multi-labels), mined from the associated radiological reports using natural language processing. Fourteen common thoracic pathologies include Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural_thickening, Cardiomegaly, Nodule, Mass and Hernia, which is an extension of the 8 common disease patterns listed in our CVPR2017 paper. Note that original radiology reports (associated with these chest x-ray studies) are not meant to be publicly shared for many reasons. The text-mined disease labels are expected to have accuracy >90%.Please find more details and benchmark performance of trained models based on 14 disease labels in our arxiv paper: [1705.02315](https://arxiv.org/abs/1705.02315)_ |
|
|
62 |
|
|
|
63 |
 |
|
|
64 |
|
|
|
65 |
## Dataset Structure |
|
|
66 |
|
|
|
67 |
### Data Instances |
|
|
68 |
|
|
|
69 |
A sample from the training set is provided below: |
|
|
70 |
|
|
|
71 |
``` |
|
|
72 |
{'image_file_path': '/root/.cache/huggingface/datasets/downloads/extracted/95db46f21d556880cf0ecb11d45d5ba0b58fcb113c9a0fff2234eba8f74fe22a/images/00000798_022.png', |
|
|
73 |
'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=1024x1024 at 0x7F2151B144D0>, |
|
|
74 |
'labels': [9, 3]} |
|
|
75 |
``` |
|
|
76 |
|
|
|
77 |
### Data Fields |
|
|
78 |
|
|
|
79 |
The data instances have the following fields: |
|
|
80 |
- `image_file_path` a `str` with the image path |
|
|
81 |
- `image`: A `PIL.Image.Image` object containing the image. Note that when accessing the image column: `dataset[0]["image"]` the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the `"image"` column, *i.e.* `dataset[0]["image"]` should **always** be preferred over `dataset["image"][0]`. |
|
|
82 |
- `labels`: an `int` classification label. |
|
|
83 |
<details> |
|
|
84 |
<summary>Class Label Mappings</summary> |
|
|
85 |
```json |
|
|
86 |
{ |
|
|
87 |
"No Finding": 0, |
|
|
88 |
"Atelectasis": 1, |
|
|
89 |
"Cardiomegaly": 2, |
|
|
90 |
"Effusion": 3, |
|
|
91 |
"Infiltration": 4, |
|
|
92 |
"Mass": 5, |
|
|
93 |
"Nodule": 6, |
|
|
94 |
"Pneumonia": 7, |
|
|
95 |
"Pneumothorax": 8, |
|
|
96 |
"Consolidation": 9, |
|
|
97 |
"Edema": 10, |
|
|
98 |
"Emphysema": 11, |
|
|
99 |
"Fibrosis": 12, |
|
|
100 |
"Pleural_Thickening": 13, |
|
|
101 |
"Hernia": 14 |
|
|
102 |
} |
|
|
103 |
``` |
|
|
104 |
</details> |
|
|
105 |
|
|
|
106 |
**Label distribution on the dataset:** |
|
|
107 |
|
|
|
108 |
| labels | obs | freq | |
|
|
109 |
|:-------------------|------:|-----------:| |
|
|
110 |
| No Finding | 60361 | 0.426468 | |
|
|
111 |
| Infiltration | 19894 | 0.140557 | |
|
|
112 |
| Effusion | 13317 | 0.0940885 | |
|
|
113 |
| Atelectasis | 11559 | 0.0816677 | |
|
|
114 |
| Nodule | 6331 | 0.0447304 | |
|
|
115 |
| Mass | 5782 | 0.0408515 | |
|
|
116 |
| Pneumothorax | 5302 | 0.0374602 | |
|
|
117 |
| Consolidation | 4667 | 0.0329737 | |
|
|
118 |
| Pleural_Thickening | 3385 | 0.023916 | |
|
|
119 |
| Cardiomegaly | 2776 | 0.0196132 | |
|
|
120 |
| Emphysema | 2516 | 0.0177763 | |
|
|
121 |
| Edema | 2303 | 0.0162714 | |
|
|
122 |
| Fibrosis | 1686 | 0.0119121 | |
|
|
123 |
| Pneumonia | 1431 | 0.0101104 | |
|
|
124 |
| Hernia | 227 | 0.00160382 | |
|
|
125 |
|
|
|
126 |
### Data Splits |
|
|
127 |
|
|
|
128 |
|
|
|
129 |
| |train| test| |
|
|
130 |
|-------------|----:|----:| |
|
|
131 |
|# of examples|86524|25596| |
|
|
132 |
|
|
|
133 |
|
|
|
134 |
**Label distribution by dataset split:** |
|
|
135 |
|
|
|
136 |
| labels | ('Train', 'obs') | ('Train', 'freq') | ('Test', 'obs') | ('Test', 'freq') | |
|
|
137 |
|:-------------------|-------------------:|--------------------:|------------------:|-------------------:| |
|
|
138 |
| No Finding | 50500 | 0.483392 | 9861 | 0.266032 | |
|
|
139 |
| Infiltration | 13782 | 0.131923 | 6112 | 0.164891 | |
|
|
140 |
| Effusion | 8659 | 0.082885 | 4658 | 0.125664 | |
|
|
141 |
| Atelectasis | 8280 | 0.0792572 | 3279 | 0.0884614 | |
|
|
142 |
| Nodule | 4708 | 0.0450656 | 1623 | 0.0437856 | |
|
|
143 |
| Mass | 4034 | 0.038614 | 1748 | 0.0471578 | |
|
|
144 |
| Consolidation | 2852 | 0.0272997 | 1815 | 0.0489654 | |
|
|
145 |
| Pneumothorax | 2637 | 0.0252417 | 2665 | 0.0718968 | |
|
|
146 |
| Pleural_Thickening | 2242 | 0.0214607 | 1143 | 0.0308361 | |
|
|
147 |
| Cardiomegaly | 1707 | 0.0163396 | 1069 | 0.0288397 | |
|
|
148 |
| Emphysema | 1423 | 0.0136211 | 1093 | 0.0294871 | |
|
|
149 |
| Edema | 1378 | 0.0131904 | 925 | 0.0249548 | |
|
|
150 |
| Fibrosis | 1251 | 0.0119747 | 435 | 0.0117355 | |
|
|
151 |
| Pneumonia | 876 | 0.00838518 | 555 | 0.0149729 | |
|
|
152 |
| Hernia | 141 | 0.00134967 | 86 | 0.00232012 | |
|
|
153 |
|
|
|
154 |
## Dataset Creation |
|
|
155 |
|
|
|
156 |
### Curation Rationale |
|
|
157 |
|
|
|
158 |
[More Information Needed] |
|
|
159 |
|
|
|
160 |
### Source Data |
|
|
161 |
|
|
|
162 |
#### Initial Data Collection and Normalization |
|
|
163 |
|
|
|
164 |
[More Information Needed] |
|
|
165 |
|
|
|
166 |
#### Who are the source language producers? |
|
|
167 |
|
|
|
168 |
[More Information Needed] |
|
|
169 |
|
|
|
170 |
### Annotations |
|
|
171 |
|
|
|
172 |
#### Annotation process |
|
|
173 |
|
|
|
174 |
[More Information Needed] |
|
|
175 |
|
|
|
176 |
#### Who are the annotators? |
|
|
177 |
|
|
|
178 |
[More Information Needed] |
|
|
179 |
|
|
|
180 |
### Personal and Sensitive Information |
|
|
181 |
|
|
|
182 |
[More Information Needed] |
|
|
183 |
|
|
|
184 |
## Considerations for Using the Data |
|
|
185 |
|
|
|
186 |
### Social Impact of Dataset |
|
|
187 |
|
|
|
188 |
[More Information Needed] |
|
|
189 |
|
|
|
190 |
### Discussion of Biases |
|
|
191 |
|
|
|
192 |
[More Information Needed] |
|
|
193 |
|
|
|
194 |
### Other Known Limitations |
|
|
195 |
|
|
|
196 |
[More Information Needed] |
|
|
197 |
|
|
|
198 |
## Additional Information |
|
|
199 |
|
|
|
200 |
### Dataset Curators |
|
|
201 |
|
|
|
202 |
[More Information Needed] |
|
|
203 |
|
|
|
204 |
### License and attribution |
|
|
205 |
|
|
|
206 |
There are no restrictions on the use of the NIH chest x-ray images. However, the dataset has the following attribution requirements: |
|
|
207 |
|
|
|
208 |
- Provide a link to the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC |
|
|
209 |
- Include a citation to the CVPR 2017 paper (see Citation information section) |
|
|
210 |
- Acknowledge that the NIH Clinical Center is the data provider |
|
|
211 |
|
|
|
212 |
|
|
|
213 |
### Citation Information |
|
|
214 |
|
|
|
215 |
``` |
|
|
216 |
@inproceedings{Wang_2017, |
|
|
217 |
doi = {10.1109/cvpr.2017.369}, |
|
|
218 |
url = {https://doi.org/10.1109%2Fcvpr.2017.369}, |
|
|
219 |
year = 2017, |
|
|
220 |
month = {jul}, |
|
|
221 |
publisher = {{IEEE} |
|
|
222 |
}, |
|
|
223 |
author = {Xiaosong Wang and Yifan Peng and Le Lu and Zhiyong Lu and Mohammadhadi Bagheri and Ronald M. Summers}, |
|
|
224 |
title = {{ChestX}-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases}, |
|
|
225 |
booktitle = {2017 {IEEE} Conference on Computer Vision and Pattern Recognition ({CVPR})} |
|
|
226 |
} |
|
|
227 |
``` |
|
|
228 |
|
|
|
229 |
### Contributions |
|
|
230 |
|
|
|
231 |
Thanks to [@alcazar90](https://github.com/alcazar90) for adding this dataset. |
|
|
232 |
|