Diff of /README.md [000000] .. [f391e7]

Switch to unified view

a b/README.md
1
---
2
annotations_creators:
3
- machine-generated
4
- expert-generated
5
language_creators:
6
- machine-generated
7
- expert-generated
8
language:
9
- en
10
license:
11
- unknown
12
multilinguality:
13
- monolingual
14
pretty_name: NIH-CXR14
15
paperswithcode_id: chestx-ray14
16
size_categories:
17
- 100K<n<1M
18
task_categories:
19
- image-classification
20
task_ids:
21
- multi-class-image-classification
22
---
23
24
# Dataset Card for NIH Chest X-ray dataset
25
26
## Table of Contents
27
28
- [Table of Contents](#table-of-contents)
29
- [Dataset Description](#dataset-description)
30
  - [Dataset Summary](#dataset-summary)
31
  - [Languages](#languages)
32
- [Dataset Structure](#dataset-structure)
33
  - [Data Instances](#data-instances)
34
  - [Data Fields](#data-fields)
35
  - [Data Splits](#data-splits)
36
- [Dataset Creation](#dataset-creation)
37
  - [Curation Rationale](#curation-rationale)
38
  - [Source Data](#source-data)
39
  - [Annotations](#annotations)
40
  - [Personal and Sensitive Information](#personal-and-sensitive-information)
41
- [Considerations for Using the Data](#considerations-for-using-the-data)
42
  - [Social Impact of Dataset](#social-impact-of-dataset)
43
  - [Discussion of Biases](#discussion-of-biases)
44
  - [Other Known Limitations](#other-known-limitations)
45
- [Additional Information](#additional-information)
46
  - [Dataset Curators](#dataset-curators)
47
  - [Licensing Information](#licensing-information)
48
  - [Citation Information](#citation-information)
49
  - [Contributions](#contributions)
50
51
## Dataset Description
52
53
- **Homepage:** [NIH Chest X-ray Dataset of 10 Common Thorax Disease Categories](https://nihcc.app.box.com/v/ChestXray-NIHCC/folder/36938765345)
54
- **Repository:**
55
- **Paper:** [ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases](https://arxiv.org/abs/1705.02315)
56
- **Leaderboard:**
57
- **Point of Contact:** rms@nih.gov
58
59
### Dataset Summary
60
61
_ChestX-ray dataset comprises 112,120 frontal-view X-ray images of 30,805 unique patients with the text-mined fourteen disease image labels (where each image can have multi-labels), mined from the associated radiological reports using natural language processing. Fourteen common thoracic pathologies include Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural_thickening, Cardiomegaly, Nodule, Mass and Hernia, which is an extension of the 8 common disease patterns listed in our CVPR2017 paper. Note that original radiology reports (associated with these chest x-ray studies) are not meant to be publicly shared for many reasons. The text-mined disease labels are expected to have accuracy >90%.Please find more details and benchmark performance of trained models based on 14 disease labels in our arxiv paper: [1705.02315](https://arxiv.org/abs/1705.02315)_
62
63
![](https://huggingface.co/datasets/alkzar90/NIH-Chest-X-ray-dataset/resolve/main/data/nih-chest-xray14-portraint.png)
64
65
## Dataset Structure
66
67
### Data Instances
68
69
A sample from the training set is provided below:
70
71
```
72
{'image_file_path': '/root/.cache/huggingface/datasets/downloads/extracted/95db46f21d556880cf0ecb11d45d5ba0b58fcb113c9a0fff2234eba8f74fe22a/images/00000798_022.png',
73
 'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=1024x1024 at 0x7F2151B144D0>,
74
 'labels': [9, 3]}
75
```
76
77
### Data Fields
78
79
The data instances have the following fields:
80
- `image_file_path` a `str` with the image path
81
- `image`: A `PIL.Image.Image` object containing the image. Note that when accessing the image column: `dataset[0]["image"]` the image file is automatically decoded. Decoding of a large number of image files might take a significant amount of time. Thus it is important to first query the sample index before the `"image"` column, *i.e.* `dataset[0]["image"]` should **always** be preferred over `dataset["image"][0]`.
82
- `labels`: an `int` classification label.
83
<details>
84
  <summary>Class Label Mappings</summary>
85
  ```json
86
  {
87
    "No Finding": 0,
88
    "Atelectasis": 1,
89
    "Cardiomegaly": 2,
90
    "Effusion": 3,
91
    "Infiltration": 4,
92
    "Mass": 5,
93
    "Nodule": 6,
94
    "Pneumonia": 7,
95
    "Pneumothorax": 8,
96
    "Consolidation": 9,
97
    "Edema": 10,
98
    "Emphysema": 11,
99
    "Fibrosis": 12,
100
    "Pleural_Thickening": 13,
101
    "Hernia": 14
102
 }
103
  ```
104
</details>
105
106
**Label distribution on the dataset:**
107
108
| labels             |   obs |       freq |
109
|:-------------------|------:|-----------:|
110
| No Finding         | 60361 | 0.426468   |
111
| Infiltration       | 19894 | 0.140557   |
112
| Effusion           | 13317 | 0.0940885  |
113
| Atelectasis        | 11559 | 0.0816677  |
114
| Nodule             |  6331 | 0.0447304  |
115
| Mass               |  5782 | 0.0408515  |
116
| Pneumothorax       |  5302 | 0.0374602  |
117
| Consolidation      |  4667 | 0.0329737  |
118
| Pleural_Thickening |  3385 | 0.023916   |
119
| Cardiomegaly       |  2776 | 0.0196132  |
120
| Emphysema          |  2516 | 0.0177763  |
121
| Edema              |  2303 | 0.0162714  |
122
| Fibrosis           |  1686 | 0.0119121  |
123
| Pneumonia          |  1431 | 0.0101104  |
124
| Hernia             |   227 | 0.00160382 |
125
126
### Data Splits
127
128
 
129
|             |train| test|
130
|-------------|----:|----:|
131
|# of examples|86524|25596|
132
133
134
**Label distribution by dataset split:**
135
136
| labels             |   ('Train', 'obs') |   ('Train', 'freq') |   ('Test', 'obs') |   ('Test', 'freq') |
137
|:-------------------|-------------------:|--------------------:|------------------:|-------------------:|
138
| No Finding         |              50500 |          0.483392   |              9861 |         0.266032   |
139
| Infiltration       |              13782 |          0.131923   |              6112 |         0.164891   |
140
| Effusion           |               8659 |          0.082885   |              4658 |         0.125664   |
141
| Atelectasis        |               8280 |          0.0792572  |              3279 |         0.0884614  |
142
| Nodule             |               4708 |          0.0450656  |              1623 |         0.0437856  |
143
| Mass               |               4034 |          0.038614   |              1748 |         0.0471578  |
144
| Consolidation      |               2852 |          0.0272997  |              1815 |         0.0489654  |
145
| Pneumothorax       |               2637 |          0.0252417  |              2665 |         0.0718968  |
146
| Pleural_Thickening |               2242 |          0.0214607  |              1143 |         0.0308361  |
147
| Cardiomegaly       |               1707 |          0.0163396  |              1069 |         0.0288397  |
148
| Emphysema          |               1423 |          0.0136211  |              1093 |         0.0294871  |
149
| Edema              |               1378 |          0.0131904  |               925 |         0.0249548  |
150
| Fibrosis           |               1251 |          0.0119747  |               435 |         0.0117355  |
151
| Pneumonia          |                876 |          0.00838518 |               555 |         0.0149729  |
152
| Hernia             |                141 |          0.00134967 |                86 |         0.00232012 |
153
154
## Dataset Creation
155
156
### Curation Rationale
157
158
[More Information Needed]
159
160
### Source Data
161
162
#### Initial Data Collection and Normalization
163
164
[More Information Needed]
165
166
#### Who are the source language producers?
167
168
[More Information Needed]
169
170
### Annotations
171
172
#### Annotation process
173
174
[More Information Needed]
175
176
#### Who are the annotators?
177
178
[More Information Needed]
179
180
### Personal and Sensitive Information
181
182
[More Information Needed]
183
184
## Considerations for Using the Data
185
186
### Social Impact of Dataset
187
188
[More Information Needed]
189
190
### Discussion of Biases
191
192
[More Information Needed]
193
194
### Other Known Limitations
195
196
[More Information Needed]
197
198
## Additional Information
199
200
### Dataset Curators
201
202
[More Information Needed]
203
204
### License and attribution 
205
206
There are no restrictions on the use of the NIH chest x-ray images. However, the dataset has the following attribution requirements:
207
208
- Provide a link to the NIH download site: https://nihcc.app.box.com/v/ChestXray-NIHCC
209
- Include a citation to the CVPR 2017 paper (see Citation information section)
210
- Acknowledge that the NIH Clinical Center is the data provider
211
212
213
### Citation Information
214
215
```
216
@inproceedings{Wang_2017,
217
    doi = {10.1109/cvpr.2017.369},
218
    url = {https://doi.org/10.1109%2Fcvpr.2017.369},
219
    year = 2017,
220
    month = {jul},
221
    publisher = {{IEEE}
222
},
223
    author = {Xiaosong Wang and Yifan Peng and Le Lu and Zhiyong Lu and Mohammadhadi Bagheri and Ronald M. Summers},
224
    title = {{ChestX}-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases},
225
    booktitle = {2017 {IEEE} Conference on Computer Vision and Pattern Recognition ({CVPR})}
226
}
227
```
228
229
### Contributions
230
231
Thanks to [@alcazar90](https://github.com/alcazar90) for adding this dataset.
232