|
a |
|
b/README.md |
|
|
1 |
|
|
|
2 |
#### 🛑 Note: please do not claim diagnostic performance of a model without a clinical study! This is not a kaggle competition dataset. Please read this paper about evaluation issues: [https://arxiv.org/abs/2004.12823](https://arxiv.org/abs/2004.12823) and [https://arxiv.org/abs/2004.05405](https://arxiv.org/abs/2004.05405) |
|
|
3 |
|
|
|
4 |
|
|
|
5 |
## COVID-19 image data collection ([🎬 video about the project](https://www.youtube.com/watch?v=ineWmqfelEQ)) |
|
|
6 |
|
|
|
7 |
Project Summary: To build a public open dataset of chest X-ray and CT images of patients which are positive or suspected of COVID-19 or other viral and bacterial pneumonias ([MERS](https://en.wikipedia.org/wiki/Middle_East_respiratory_syndrome), [SARS](https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome), and [ARDS](https://en.wikipedia.org/wiki/Acute_respiratory_distress_syndrome).). Data will be collected from public sources as well as through indirect collection from hospitals and physicians. All images and data will be released publicly in this GitHub repo. |
|
|
8 |
|
|
|
9 |
This project is approved by the University of Montreal's Ethics Committee #CERSES-20-058-D |
|
|
10 |
|
|
|
11 |
## View current [images](images) and [metadata](metadata.csv) and [a dataloader example](https://colab.research.google.com/drive/1A-gIZ6Xp-eh2b4CGS6BHH7-OgZtyjeP2) |
|
|
12 |
|
|
|
13 |
The labels are arranged in a hierarchy: |
|
|
14 |
|
|
|
15 |
<img width=300 src="docs/hierarchy.jpg"/> |
|
|
16 |
|
|
|
17 |
|
|
|
18 |
Current stats of PA, AP, and AP Supine views. Labels 0=No or 1=Yes. Data loader is [here](https://github.com/mlmed/torchxrayvision/blob/master/torchxrayvision/datasets.py#L867) |
|
|
19 |
``` |
|
|
20 |
COVID19_Dataset num_samples=481 views=['PA', 'AP'] |
|
|
21 |
{'ARDS': {0.0: 465, 1.0: 16}, |
|
|
22 |
'Bacterial': {0.0: 445, 1.0: 36}, |
|
|
23 |
'COVID-19': {0.0: 162, 1.0: 319}, |
|
|
24 |
'Chlamydophila': {0.0: 480, 1.0: 1}, |
|
|
25 |
'E.Coli': {0.0: 481}, |
|
|
26 |
'Fungal': {0.0: 459, 1.0: 22}, |
|
|
27 |
'Influenza': {0.0: 478, 1.0: 3}, |
|
|
28 |
'Klebsiella': {0.0: 474, 1.0: 7}, |
|
|
29 |
'Legionella': {0.0: 474, 1.0: 7}, |
|
|
30 |
'Lipoid': {0.0: 473, 1.0: 8}, |
|
|
31 |
'MERS': {0.0: 481}, |
|
|
32 |
'Mycoplasma': {0.0: 476, 1.0: 5}, |
|
|
33 |
'No Finding': {0.0: 467, 1.0: 14}, |
|
|
34 |
'Pneumocystis': {0.0: 459, 1.0: 22}, |
|
|
35 |
'Pneumonia': {0.0: 36, 1.0: 445}, |
|
|
36 |
'SARS': {0.0: 465, 1.0: 16}, |
|
|
37 |
'Streptococcus': {0.0: 467, 1.0: 14}, |
|
|
38 |
'Varicella': {0.0: 476, 1.0: 5}, |
|
|
39 |
'Viral': {0.0: 138, 1.0: 343}} |
|
|
40 |
|
|
|
41 |
COVID19_Dataset num_samples=173 views=['AP Supine'] |
|
|
42 |
{'ARDS': {0.0: 170, 1.0: 3}, |
|
|
43 |
'Bacterial': {0.0: 169, 1.0: 4}, |
|
|
44 |
'COVID-19': {0.0: 41, 1.0: 132}, |
|
|
45 |
'Chlamydophila': {0.0: 173}, |
|
|
46 |
'E.Coli': {0.0: 169, 1.0: 4}, |
|
|
47 |
'Fungal': {0.0: 171, 1.0: 2}, |
|
|
48 |
'Influenza': {0.0: 173}, |
|
|
49 |
'Klebsiella': {0.0: 173}, |
|
|
50 |
'Legionella': {0.0: 173}, |
|
|
51 |
'Lipoid': {0.0: 173}, |
|
|
52 |
'MERS': {0.0: 173}, |
|
|
53 |
'Mycoplasma': {0.0: 173}, |
|
|
54 |
'No Finding': {0.0: 170, 1.0: 3}, |
|
|
55 |
'Pneumocystis': {0.0: 171, 1.0: 2}, |
|
|
56 |
'Pneumonia': {0.0: 26, 1.0: 147}, |
|
|
57 |
'SARS': {0.0: 173}, |
|
|
58 |
'Streptococcus': {0.0: 173}, |
|
|
59 |
'Varicella': {0.0: 173}, |
|
|
60 |
'Viral': {0.0: 41, 1.0: 132}} |
|
|
61 |
|
|
|
62 |
``` |
|
|
63 |
|
|
|
64 |
## Annotations |
|
|
65 |
|
|
|
66 |
[Lung Bounding Boxes](https://github.com/GeneralBlockchain/covid-19-chest-xray-lung-bounding-boxes-dataset) and [Chest X-ray Segmentation](https://github.com/GeneralBlockchain/covid-19-chest-xray-segmentations-dataset) (license: CC BY 4.0) contributed by [General Blockchain, Inc.](https://github.com/GeneralBlockchain) |
|
|
67 |
|
|
|
68 |
[Pneumonia severity scores for 94 images](annotations/covid-severity-scores.csv) (license: CC BY-SA) from the paper [Predicting COVID-19 Pneumonia Severity on Chest X-ray with Deep Learning](http://arxiv.org/abs/2005.11856) |
|
|
69 |
|
|
|
70 |
[Generated Lung Segmentations](annotations/lungVAE-masks) (license: CC BY-SA) from the paper [Lung Segmentation from Chest X-rays using Variational Data Imputation](https://arxiv.org/abs/2005.10052) |
|
|
71 |
|
|
|
72 |
[Brixia score for 192 images](https://github.com/BrixIA/Brixia-score-COVID-19) (license: CC BY-NC-SA) from the paper [End-to-end learning for semiquantitative rating of COVID-19 severity on Chest X-rays](https://arxiv.org/abs/2006.04603) |
|
|
73 |
|
|
|
74 |
[Lung and other segmentations for 517 images](https://github.com/v7labs/covid-19-xray-dataset/tree/master/annotations) (license: CC BY) in COCO and raster formats by [v7labs](https://github.com/v7labs/covid-19-xray-dataset) |
|
|
75 |
|
|
|
76 |
## Contribute |
|
|
77 |
|
|
|
78 |
- Submit data directly to the project. View our [research protocol](https://docs.google.com/document/d/14b7cou98YhYcJ2jwOKznChtn5y6-mi9bgjeFv2DxOt0/edit). Contact us to start the process. |
|
|
79 |
- We can extract images from publications. Help identify publications which are not already included using a GitHub issue (DOIs we have are listed in the metadata file). There is a searchable database of COVID-19 papers [here](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/global-research-on-novel-coronavirus-2019-ncov), and a non-searchable one (requires download) [here](https://pages.semanticscholar.org/coronavirus-research). |
|
|
80 |
|
|
|
81 |
- Submit data to these sites (we can scrape the data from them): |
|
|
82 |
- https://radiopaedia.org/ (license CC BY-NC-SA) |
|
|
83 |
- https://www.sirm.org/category/senza-categoria/covid-19/ |
|
|
84 |
- https://www.eurorad.org/ (license CC BY-NC-SA) |
|
|
85 |
- https://coronacases.org/ (preferred for CT scans, license Apache 2.0) |
|
|
86 |
|
|
|
87 |
- Provide bounding box/masks for the detection of problematic regions in images already collected. |
|
|
88 |
|
|
|
89 |
- See [SCHEMA.md](SCHEMA.md) for more information on the metadata schema. |
|
|
90 |
|
|
|
91 |
*Formats:* For chest X-ray dcm, jpg, or png are preferred. For CT nifti (in gzip format) is preferred but also dcms. Please contact with any questions. |
|
|
92 |
|
|
|
93 |
## Background |
|
|
94 |
|
|
|
95 |
In the context of a COVID-19 pandemic, we want to improve prognostic predictions to triage and manage patient care. Data is the first step to developing any diagnostic/prognostic tool. While there exist large public datasets of more typical chest X-rays from the NIH [Wang 2017], Spain [Bustos 2019], Stanford [Irvin 2019], MIT [Johnson 2019] and Indiana University [Demner-Fushman 2016], there is no collection of COVID-19 chest X-rays or CT scans designed to be used for computational analysis. |
|
|
96 |
|
|
|
97 |
The 2019 novel coronavirus (COVID-19) presents several unique features [Fang, 2020](https://pubs.rsna.org/doi/10.1148/radiol.2020200432) and [Ai 2020](https://pubs.rsna.org/doi/10.1148/radiol.2020200642). While the diagnosis is confirmed using polymerase chain reaction (PCR), infected patients with pneumonia may present on chest X-ray and computed tomography (CT) images with a pattern that is only moderately characteristic for the human eye [Ng, 2020](https://pubs.rsna.org/doi/10.1148/ryct.2020200034). In late January, a Chinese team published a paper detailing the clinical and paraclinical features of COVID-19. They reported that patients present abnormalities in chest CT images with most having bilateral involvement [Huang 2020](https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30183-5/fulltext). Bilateral multiple lobular and subsegmental areas of consolidation constitute the typical findings in chest CT images of intensive care unit (ICU) patients on admission [Huang 2020](https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30183-5/fulltext). In comparison, non-ICU patients show bilateral ground-glass opacity and subsegmental areas of consolidation in their chest CT images [Huang 2020](https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30183-5/fulltext). In these patients, later chest CT images display bilateral ground-glass opacity with resolved consolidation [Huang 2020](https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30183-5/fulltext). |
|
|
98 |
|
|
|
99 |
|
|
|
100 |
## Goal |
|
|
101 |
|
|
|
102 |
Our goal is to use these images to develop AI based approaches to predict and understand the infection. Our group will work to release these models using our open source [Chester AI Radiology Assistant platform](https://mlmed.org/tools/xray/). |
|
|
103 |
|
|
|
104 |
The tasks are as follows using chest X-ray or CT (preference for X-ray) as input to predict these tasks: |
|
|
105 |
|
|
|
106 |
- Healthy vs Pneumonia (prototype already implemented [Chester](https://mlmed.org/tools/xray/) with ~74% AUC, validation study [here](https://arxiv.org/abs/2002.02497)) |
|
|
107 |
|
|
|
108 |
- ~~Bacterial vs Viral vs COVID-19 Pneumonia~~ (not relevant enough for the clinical workflows) |
|
|
109 |
|
|
|
110 |
- Prognostic/severity predictions (survival, need for intubation, need for supplemental oxygen) |
|
|
111 |
|
|
|
112 |
## Expected outcomes |
|
|
113 |
|
|
|
114 |
Tool impact: This would give physicians an edge and allow them to act with more confidence while they wait for the analysis of a radiologist by having a digital second opinion confirm their assessment of a patient's condition. Also, these tools can provide quantitative scores to consider and use in studies. |
|
|
115 |
|
|
|
116 |
Data impact: Image data linked with clinically relevant attributes in a public dataset that is designed for ML will enable parallel development of these tools and rapid local validation of models. Furthermore, this data can be used for completely different tasks. |
|
|
117 |
|
|
|
118 |
|
|
|
119 |
## Contact |
|
|
120 |
PI: [Joseph Paul Cohen. Postdoctoral Fellow, Mila, University of Montreal](https://josephpcohen.com/) |
|
|
121 |
|
|
|
122 |
## Citations |
|
|
123 |
|
|
|
124 |
Second Paper available [here](http://arxiv.org/abs/2006.11988) and [source code for baselines](https://github.com/mlmed/torchxrayvision/tree/master/scripts/covid-baselines) |
|
|
125 |
|
|
|
126 |
``` |
|
|
127 |
COVID-19 Image Data Collection: Prospective Predictions Are the Future |
|
|
128 |
Joseph Paul Cohen and Paul Morrison and Lan Dao and Karsten Roth and Tim Q Duong and Marzyeh Ghassemi |
|
|
129 |
arXiv:2006.11988, https://github.com/ieee8023/covid-chestxray-dataset, 2020 |
|
|
130 |
``` |
|
|
131 |
|
|
|
132 |
``` |
|
|
133 |
@article{cohen2020covidProspective, |
|
|
134 |
title={COVID-19 Image Data Collection: Prospective Predictions Are the Future}, |
|
|
135 |
author={Joseph Paul Cohen and Paul Morrison and Lan Dao and Karsten Roth and Tim Q Duong and Marzyeh Ghassemi}, |
|
|
136 |
journal={arXiv 2006.11988}, |
|
|
137 |
url={https://github.com/ieee8023/covid-chestxray-dataset}, |
|
|
138 |
year={2020} |
|
|
139 |
} |
|
|
140 |
``` |
|
|
141 |
|
|
|
142 |
Paper available [here](https://arxiv.org/abs/2003.11597) |
|
|
143 |
|
|
|
144 |
``` |
|
|
145 |
COVID-19 image data collection, arXiv:2003.11597, 2020 |
|
|
146 |
Joseph Paul Cohen and Paul Morrison and Lan Dao |
|
|
147 |
https://github.com/ieee8023/covid-chestxray-dataset |
|
|
148 |
``` |
|
|
149 |
|
|
|
150 |
``` |
|
|
151 |
@article{cohen2020covid, |
|
|
152 |
title={COVID-19 image data collection}, |
|
|
153 |
author={Joseph Paul Cohen and Paul Morrison and Lan Dao}, |
|
|
154 |
journal={arXiv 2003.11597}, |
|
|
155 |
url={https://github.com/ieee8023/covid-chestxray-dataset}, |
|
|
156 |
year={2020} |
|
|
157 |
} |
|
|
158 |
``` |
|
|
159 |
|
|
|
160 |
<meta name="citation_title" content="COVID-19 image data collection" /> |
|
|
161 |
<meta name="citation_publication_date" content="2020" /> |
|
|
162 |
<meta name="citation_author" content="Joseph Paul Cohen and Paul Morrison and Lan Dao" /> |
|
|
163 |
|
|
|
164 |
## License |
|
|
165 |
|
|
|
166 |
Each image has license specified in the metadata.csv file. Including Apache 2.0, CC BY-NC-SA 4.0, CC BY 4.0. |
|
|
167 |
|
|
|
168 |
The metadata.csv, scripts, and other documents are released under a CC BY-NC-SA 4.0 license. Companies are free to perform research. Beyond that contact us. |