Diff of /EDA/README.md [000000] .. [df6751]

Switch to unified view

a b/EDA/README.md
1
2
# Data Analysis Report for MURA
3
4
MURA is a dataset of musculoskeletal radiographs consisting of 14,982 `studies` from 12,251 `patients`, with a total of 40,895 `multi-view radiographic images`. Each `study` belongs to one of seven standard upper extremity radiographic `study
5
types`: elbow, finger, forearm, hand, humerus, shoulder and wrist. 
6
7
## Components of MURA dataset
8
9
MURA dataset comes with `train`, `valid` and `test` folders containing corresponding datasets, `train.csv` and `valid.csv` contain paths of `radiographic images` and their labels. Each image is labeled as 1 (abnormal) or 0 (normal) based on whether its corresponding study is negative or positive, respectively <br>
10
11
Sometimes, these radiographic images are also referred as `views`.
12
13
## Components of `train` and `valid` set
14
15
* `train` set consists of seven `study types` namely:
16
17
    `XR_ELBOW`  `XR_FINGER`  `XR_FOREARM`  `XR_HAND`  `XR_HUMERUS`  `XR_SHOULDER`  `XR_WRIST`
18
19
* Each `study type` contains several folders named like:
20
21
    `patient12104`  `patient12110`  `patient12116`  `patient12122`  `patient12128` ... 
22
23
* These folders are named after patient ids, each of these folders contain one or more `study`, named like:
24
25
    `study1_negative` `study2_negative` `study3_positive` ... <br>
26
27
* Each of these `study`s contains one or more radiographs (views or images), named like:
28
29
    `image1.png`  `image2.png` ...
30
31
* Each view (image) is RGB with pixel range [0, 255] and varies in dimensions.
32
33
**NOTE**: all above points are true for `test` set, except the third point, the `study` folder are named like: `study1`    `study2` ..
34
35
## Some insightful plots
36
37
### Plot of number of Patients vs `study type`
38
39
<img src="images/pcpst.jpg"></img>
40
41
In `train` set `XR_WRIST` has maximum number of patients, followed by `XR_FINGER`, `XR_HUMERUS`, `XR_SHOULDER`, `XR_HAND`, `XR_ELBOW` and `XR_FOREARM`. `X_FOREARM` with 606 patients has got the least number. Similar pattern can be seen in  `valid` set, XR_WRIST has the maximum, followed by `XR_FINGER`, `XR_SHOULDER`,`XR_HUMEROUS`, `XR_HAND`,`XR_ELBOW`, `XR_FOREARM`.
42
43
### Plot of number of patients vs study count
44
45
Patients of a `study type` might have multiple `study`s, like a patient may have 3 `study`s for wrist, independent of each other. <br> 
46
The following plot shows variation of number of patients with number of `study`s
47
48
**NOTE** study count = number of studies, so if 4 patients have study count 3, that means 4 patients have undergone 3 `study`s for a given `study type`
49
50
<img src="images/pcpsc.jpg"></img>
51
52
53
Patients of `XR_FOREARM` and `XR_HUMEROUS` `study type`s have either 1 `study` or 2 only.
54
Patients of `XR_FINGER`, `XR_HAND` and `XR_ELBOW` have upto 3 `study`s.
55
Patients of `XR_SHOULDER` and `XR_WRIST` have upto 4 `study`s
56
57
### Plot of number of `study`s vs number of views
58
59
Each `study` may have one or more number of views, the following plot variation of number of views per study in train dataset
60
61
<img src="images/nsvc.jpg"></img>
62
63
Maximum number of views per study can be found in `XR_SHOULDER`, there is a study in it which has as many as 13 images (views), similarlyy `XR_HUMEROUS` has a study with 10 images. It can be seen that most of the `study`s have either 2, 3 or 4 images.