Diff of /README.md [000000] .. [363198]

Switch to unified view

a b/README.md
1
2
#### 🛑 Note: please do not claim diagnostic performance of a model without a clinical study! This is not a kaggle competition dataset. Please read this paper about evaluation issues: [https://arxiv.org/abs/2004.12823](https://arxiv.org/abs/2004.12823) and [https://arxiv.org/abs/2004.05405](https://arxiv.org/abs/2004.05405)
3
4
5
## COVID-19 image data collection ([🎬 video about the project](https://www.youtube.com/watch?v=ineWmqfelEQ))
6
7
Project Summary: To build a public open dataset of chest X-ray and CT images of patients which are positive or suspected of COVID-19 or other viral and bacterial pneumonias ([MERS](https://en.wikipedia.org/wiki/Middle_East_respiratory_syndrome), [SARS](https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome), and [ARDS](https://en.wikipedia.org/wiki/Acute_respiratory_distress_syndrome).). Data will be collected from public sources as well as through indirect collection from hospitals and physicians. All images and data will be released publicly in this GitHub repo. 
8
9
This project is approved by the University of Montreal's Ethics Committee #CERSES-20-058-D
10
11
## View current [images](images) and [metadata](metadata.csv) and [a dataloader example](https://colab.research.google.com/drive/1A-gIZ6Xp-eh2b4CGS6BHH7-OgZtyjeP2)
12
13
The labels are arranged in a hierarchy:
14
15
<img width=300 src="docs/hierarchy.jpg"/>
16
17
18
Current stats of PA, AP, and AP Supine views. Labels 0=No or 1=Yes. Data loader is [here](https://github.com/mlmed/torchxrayvision/blob/master/torchxrayvision/datasets.py#L867)
19
``` 
20
COVID19_Dataset num_samples=481 views=['PA', 'AP']
21
{'ARDS': {0.0: 465, 1.0: 16},
22
 'Bacterial': {0.0: 445, 1.0: 36},
23
 'COVID-19': {0.0: 162, 1.0: 319},
24
 'Chlamydophila': {0.0: 480, 1.0: 1},
25
 'E.Coli': {0.0: 481},
26
 'Fungal': {0.0: 459, 1.0: 22},
27
 'Influenza': {0.0: 478, 1.0: 3},
28
 'Klebsiella': {0.0: 474, 1.0: 7},
29
 'Legionella': {0.0: 474, 1.0: 7},
30
 'Lipoid': {0.0: 473, 1.0: 8},
31
 'MERS': {0.0: 481},
32
 'Mycoplasma': {0.0: 476, 1.0: 5},
33
 'No Finding': {0.0: 467, 1.0: 14},
34
 'Pneumocystis': {0.0: 459, 1.0: 22},
35
 'Pneumonia': {0.0: 36, 1.0: 445},
36
 'SARS': {0.0: 465, 1.0: 16},
37
 'Streptococcus': {0.0: 467, 1.0: 14},
38
 'Varicella': {0.0: 476, 1.0: 5},
39
 'Viral': {0.0: 138, 1.0: 343}}
40
41
COVID19_Dataset num_samples=173 views=['AP Supine']
42
{'ARDS': {0.0: 170, 1.0: 3},
43
 'Bacterial': {0.0: 169, 1.0: 4},
44
 'COVID-19': {0.0: 41, 1.0: 132},
45
 'Chlamydophila': {0.0: 173},
46
 'E.Coli': {0.0: 169, 1.0: 4},
47
 'Fungal': {0.0: 171, 1.0: 2},
48
 'Influenza': {0.0: 173},
49
 'Klebsiella': {0.0: 173},
50
 'Legionella': {0.0: 173},
51
 'Lipoid': {0.0: 173},
52
 'MERS': {0.0: 173},
53
 'Mycoplasma': {0.0: 173},
54
 'No Finding': {0.0: 170, 1.0: 3},
55
 'Pneumocystis': {0.0: 171, 1.0: 2},
56
 'Pneumonia': {0.0: 26, 1.0: 147},
57
 'SARS': {0.0: 173},
58
 'Streptococcus': {0.0: 173},
59
 'Varicella': {0.0: 173},
60
 'Viral': {0.0: 41, 1.0: 132}}
61
62
 ```
63
 
64
## Annotations
65
66
[Lung Bounding Boxes](https://github.com/GeneralBlockchain/covid-19-chest-xray-lung-bounding-boxes-dataset) and [Chest X-ray Segmentation](https://github.com/GeneralBlockchain/covid-19-chest-xray-segmentations-dataset) (license: CC BY 4.0) contributed by [General Blockchain, Inc.](https://github.com/GeneralBlockchain)
67
68
[Pneumonia severity scores for 94 images](annotations/covid-severity-scores.csv) (license: CC BY-SA) from the paper [Predicting COVID-19 Pneumonia Severity on Chest X-ray with Deep Learning](http://arxiv.org/abs/2005.11856)
69
70
[Generated Lung Segmentations](annotations/lungVAE-masks) (license: CC BY-SA) from the paper [Lung Segmentation from Chest X-rays using Variational Data Imputation](https://arxiv.org/abs/2005.10052)
71
72
[Brixia score for 192 images](https://github.com/BrixIA/Brixia-score-COVID-19) (license: CC BY-NC-SA) from the paper [End-to-end learning for semiquantitative rating of COVID-19 severity on Chest X-rays](https://arxiv.org/abs/2006.04603)
73
74
[Lung and other segmentations for 517 images](https://github.com/v7labs/covid-19-xray-dataset/tree/master/annotations) (license: CC BY) in COCO and raster formats by [v7labs](https://github.com/v7labs/covid-19-xray-dataset)
75
76
## Contribute
77
78
 - Submit data directly to the project. View our [research protocol](https://docs.google.com/document/d/14b7cou98YhYcJ2jwOKznChtn5y6-mi9bgjeFv2DxOt0/edit). Contact us to start the process.
79
 - We can extract images from publications. Help identify publications which are not already included using a GitHub issue (DOIs we have are listed in the metadata file). There is a searchable database of COVID-19 papers [here](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov), and a non-searchable one (requires download) [here](https://pages.semanticscholar.org/coronavirus-research).
80
 
81
 - Submit data to these sites (we can scrape the data from them):
82
    - https://radiopaedia.org/ (license CC BY-NC-SA)
83
    - https://www.sirm.org/category/senza-categoria/covid-19/ 
84
    - https://www.eurorad.org/ (license CC BY-NC-SA)
85
    - https://coronacases.org/ (preferred for CT scans, license Apache 2.0)
86
 
87
 - Provide bounding box/masks for the detection of problematic regions in images already collected.
88
89
 - See [SCHEMA.md](SCHEMA.md) for more information on the metadata schema.
90
91
*Formats:* For chest X-ray dcm, jpg, or png are preferred. For CT nifti (in gzip format) is preferred but also dcms. Please contact with any questions.
92
93
## Background 
94
95
In the context of a COVID-19 pandemic, we want to improve prognostic predictions to triage and manage patient care. Data is the first step to developing any diagnostic/prognostic tool. While there exist large public datasets of more typical chest X-rays from the NIH [Wang 2017], Spain [Bustos 2019], Stanford [Irvin 2019], MIT [Johnson 2019] and Indiana University [Demner-Fushman 2016], there is no collection of COVID-19 chest X-rays or CT scans designed to be used for computational analysis.
96
97
The 2019 novel coronavirus (COVID-19) presents several unique features [Fang, 2020](https://pubs.rsna.org/doi/10.1148/radiol.2020200432) and [Ai 2020](https://pubs.rsna.org/doi/10.1148/radiol.2020200642). While the diagnosis is confirmed using polymerase chain reaction (PCR), infected patients with pneumonia may present on chest X-ray and computed tomography (CT) images with a pattern that is only moderately characteristic for the human eye [Ng, 2020](https://pubs.rsna.org/doi/10.1148/ryct.2020200034). In late January, a Chinese team published a paper detailing the clinical and paraclinical features of COVID-19. They reported that patients present abnormalities in chest CT images with most having bilateral involvement [Huang 2020](https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30183-5/fulltext). Bilateral multiple lobular and subsegmental areas of consolidation constitute the typical findings in chest CT images of intensive care unit (ICU) patients on admission [Huang 2020](https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30183-5/fulltext). In comparison, non-ICU patients show bilateral ground-glass opacity and subsegmental areas of consolidation in their chest CT images [Huang 2020](https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30183-5/fulltext). In these patients, later chest CT images display bilateral ground-glass opacity with resolved consolidation [Huang 2020](https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30183-5/fulltext). 
98
99
100
## Goal
101
102
Our goal is to use these images to develop AI based approaches to predict and understand the infection. Our group will work to release these models using our open source [Chester AI Radiology Assistant platform](https://mlmed.org/tools/xray/).
103
104
The tasks are as follows using chest X-ray or CT (preference for X-ray) as input to predict these tasks:
105
106
- Healthy vs Pneumonia (prototype already implemented [Chester](https://mlmed.org/tools/xray/) with ~74% AUC, validation study [here](https://arxiv.org/abs/2002.02497))
107
108
- ~~Bacterial vs Viral vs COVID-19 Pneumonia~~ (not relevant enough for the clinical workflows)
109
110
- Prognostic/severity predictions (survival, need for intubation, need for supplemental oxygen)
111
112
## Expected outcomes
113
114
Tool impact: This would give physicians an edge and allow them to act with more confidence while they wait for the analysis of a radiologist by having a digital second opinion confirm their assessment of a patient's condition. Also, these tools can provide quantitative scores to consider and use in studies.
115
116
Data impact: Image data linked with clinically relevant attributes in a public dataset that is designed for ML will enable parallel development of these tools and rapid local validation of models. Furthermore, this data can be used for completely different tasks.
117
118
119
## Contact
120
PI: [Joseph Paul Cohen. Postdoctoral Fellow, Mila, University of Montreal](https://josephpcohen.com/) 
121
122
## Citations
123
124
Second Paper available [here](http://arxiv.org/abs/2006.11988) and [source code for baselines](https://github.com/mlmed/torchxrayvision/tree/master/scripts/covid-baselines)
125
126
```
127
COVID-19 Image Data Collection: Prospective Predictions Are the Future
128
Joseph Paul Cohen and Paul Morrison and Lan Dao and Karsten Roth and Tim Q Duong and Marzyeh Ghassemi
129
arXiv:2006.11988, https://github.com/ieee8023/covid-chestxray-dataset, 2020
130
```
131
132
```
133
@article{cohen2020covidProspective,
134
  title={COVID-19 Image Data Collection: Prospective Predictions Are the Future},
135
  author={Joseph Paul Cohen and Paul Morrison and Lan Dao and Karsten Roth and Tim Q Duong and Marzyeh Ghassemi},
136
  journal={arXiv 2006.11988},
137
  url={https://github.com/ieee8023/covid-chestxray-dataset},
138
  year={2020}
139
}
140
```
141
142
Paper available [here](https://arxiv.org/abs/2003.11597)
143
144
```
145
COVID-19 image data collection, arXiv:2003.11597, 2020
146
Joseph Paul Cohen and Paul Morrison and Lan Dao
147
https://github.com/ieee8023/covid-chestxray-dataset
148
```
149
150
```
151
@article{cohen2020covid,
152
  title={COVID-19 image data collection},
153
  author={Joseph Paul Cohen and Paul Morrison and Lan Dao},
154
  journal={arXiv 2003.11597},
155
  url={https://github.com/ieee8023/covid-chestxray-dataset},
156
  year={2020}
157
}
158
```
159
160
<meta name="citation_title" content="COVID-19 image data collection" />
161
<meta name="citation_publication_date" content="2020" />
162
<meta name="citation_author" content="Joseph Paul Cohen and Paul Morrison and Lan Dao" />
163
164
## License
165
166
Each image has license specified in the metadata.csv file. Including Apache 2.0, CC BY-NC-SA 4.0, CC BY 4.0.
167
168
The metadata.csv, scripts, and other documents are released under a CC BY-NC-SA 4.0 license. Companies are free to perform research. Beyond that contact us.