Diff of /README.md [000000] .. [0375db]

Switch to unified view

a b/README.md
1
# Machine learning on electronic health records
2
3
Repository of code developed for *Machine learning models in electronic health records can outperform conventional survival models for predicting patient mortality in coronary artery disease*.
4
5
## Introduction
6
7
The R scripts provided in this repository were developed to perform survival modelling on 100,000 patients’ electronic health records.
8
9
## Usage
10
11
To use these scripts, download them into a subfolder ``code`` of a folder which also contains subfolders ``data`` (which should contain input data) and ``output`` (where output files are placed by the scripts).
12
13
Most scripts [can use ``spin`` to generate reports](http://deanattali.com/2015/03/24/knitrs-best-hidden-gem-spin/), and the simplest way to do this is to execute them using the ``spinme.R`` script provided. For example, to run the discretised Cox model script and generate a report, navigate to the ``code/cox-ph`` folder of your project, and run
14
15
```
16
Rscript ../scripts/spinme.R cox-discretised.R
17
```
18
19
This will generate a report ``cox-discretised.html`` in the working directory, as well as additional files in the ``output`` folder.
20
21
## Project inventory
22
23
This is a brief description of the important files in this project, broken down by folder.
24
25
### age-only
26
27
#### age-only.R
28
29
This calculates the C-index and calibration score for a model where age is the only variable. For the C-index, it is assumed that the older patient will die first, and for calibration the Kaplan–Meier estimator for patients of a given age is assumed to be the risk estimate for all patients of that age.
30
31
### cox-ph
32
33
Various Cox proportional hazards models. Those prefixed ``caliber-replicate-`` are based on the Cox model developed in [Rapsomaniki et al. 2014](https://academic.oup.com/eurheartj/article-lookup/doi/10.1093/eurheartj/eht533) (DOI: [10.1093/eurheartj/eht533](https://dx.doi.org/10.1093/eurheartj/eht533)).
34
35
#### caliber-replicate-with-imputation.R
36
37
This model is as close to identical to Rapsomaniki et al. as possible, using a five-fold multiply-imputed dataset, with continuous variables scaled as in that paper.
38
39
#### caliber-replicate-with-missing.R
40
41
This model uses the same scaling as Rapsomaniki et al., but is conducted on a single dataset with missing values represented by missing indicator variables rather than imputed.
42
43
#### caliber-scale.R
44
45
Functions to scale data for Cox modelling.
46
47
#### cox-discrete-elasticnet.R
48
49
Discrete elastic net Cox model for the data-driven modelling, which cross-validates to find the optimal _α_ and then bootstraps to establish distributions for the other fitted parameters.
50
51
#### cox-discrete-varsellogrank.R
52
53
Discrete Cox model for the data-driven modelling, which cross-validates over number of variables used, drawing from a list ranked by univariate logrank tests.
54
55
#### cox-discretised.R
56
57
This model uses the expert-selected dataset with discretised versions of continuous variables to allow missing values to be incorporated, and cross-validates to determine the discretisation scheme.
58
59
#### cox-discretised-imputed.R
60
61
This model uses the imputed version of the expert-selected dataset with discretised versions of continuous variables, following the same method as above.
62
63
#### rapsomaniki-cox-values-from-paper.csv
64
65
Values for Cox coefficients transcribed from Rapsomaniki et al., used to check for consistency between that model and these.
66
67
### lib
68
69
Shared libraries for functions and common routines.
70
71
#### all-cv-bootstrap.R
72
73
Script to cross-validate discretisation schemes, followed by bootstrapping the selected optimal model. Works for both Cox modelling and random forests with either ``randomForestSRC`` or ``ranger``. Discretised random forests were not used in the final analysis, as there was no appreciable performance gain.
74
75
#### handy.R
76
77
Shortcuts and wrappers, from [Andrew’s](https://github.com/ajsteele/) handy [handy.R](https://github.com/ajsteele/handy.R) script.
78
79
#### handymedical.R
80
81
Useful functions and wrappers for preparing and manipulating data, and making use of Cox models and random forests with either ``randomForestSRC`` or ``ranger``, including bootstrapping, as transparent and consistent as possible. These functions are hopefully of general use for other survival modelling projects; dataset-specific functions are defined in ``shared.R``.
82
83
#### rfsrc-cv-mtry-nsplit-logical.R
84
85
Script to cross-validate ``randomForestSRC`` survival forests using the large dataset, optimising the ``mtry`` and ``nsplit`` hyperparameters.
86
87
#### rfsrc-cv-nsplit-bootstrap.R
88
89
Script to cross-validate ``randomForestSRC`` survival forests using the expert-selected dataset, optimising ``nsplit``.
90
91
#### shared.R
92
93
This script is run at the start of most model scripts, and defines a random seed, plus variables and functions which will be useful. The data-parsing functions here are specific to the scheme of this particular dataset and so were excluded from ``handymedical.R``.
94
95
### overview
96
97
Various scripts for exploring the dataset and retrieving and plotting results for publication.
98
99
#### all-models.R
100
101
Produces a graph of the C-index and calibration score from all models. The basis of Fig. 1 in the paper.
102
103
#### bigdata-mtry-nsplit.R
104
105
Plots a line graph showing C-index performance of random forests depending on ``mtry`` and ``nsplit`` in the large dataset.
106
107
#### calibration-plots.R
108
109
Plots two example calibration curves to show how the calibration score is calculated. The basis of Fig. 2 in the paper.
110
111
#### cohort-tables.R
112
113
Prints a number of summary statistics used for Table 2 in the paper.
114
115
#### explore-dataset.R
116
117
A number of quick exploratory graphs and comparisons to explore the expert-selected dataset, with a particular focus on degrees and distribution of missing data.
118
119
#### missing-values-risk.R
120
121
Compares coefficients for Cox models. First, continuous imputed vs continuous with missing indicators and discrete; second, ranges of continuous values’ associated risks with those associated with a value being missing; finally, survival curves for patients with a particular value missing vs present. The basis of Fig. 3 in the paper.
122
123
#### performance-differences.R
124
125
Pairwise differences with uncertainty in C-index and calibration between all models tested, ascertained by finding the distribution of differences between bootstrap replicates for each model-pair.
126
127
#### variable-effects.R
128
129
Plots of variable effects for continuous and discrete Cox models, and random forests. The basis of Fig. 4 in the paper.
130
131
#### variable-importances.R
132
133
Plots permutation variable importances calculated for the final data-driven models, post variable selection. The basis of Fig. 5 in the paper.
134
135
### random-forest
136
137
#### rf-age.R
138
139
Building a random forest with fewer variables (including just age) to experiment with predictive power.
140
141
#### rf-classification.R
142
143
Classification forest for death at 5 years, in an attempt to improve calibration score of the resulting model.
144
145
#### rf-imputed.R
146
147
Random forest on the imputed dataset as an empirical test of whether imputation provides an advantage.
148
149
#### rfsrc-cv.R
150
151
Random forest on the expert-selected dataset, which uses ``rfsrc-cv-nsplit-bootstrap.R`` from ``lib`` (see above) to fit its forest.
152
153
#### rf-varsellogrank.R
154
155
Random forest for the data-driven modelling, which cross-validates over number of variables used, drawing from a list ranked by univariate logrank tests.
156
157
#### rf-varselmiss.R
158
159
Random forest for the data-driven modelling, which cross-validates over number of variables used, drawing from a list ranked by decreasing missingness.
160
161
#### rf-varselrf-eqv.R
162
163
Random forest for the data-driven modelling, which cross-validates over number of variables used, drawing from a list ranked by the variable importance of a large random forest fitted to all the data. Modelled after [varSelRF](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-3).
164
165
### scripts
166
167
This folder is for a few miscellaneous short scripts.
168
169
#### export-bigdata.R
170
171
This script was used to export anonymised data for the data-driven modelling with ~600 variables. Its file paths are not correct because they are localised for the secure environment where the raw data are stored.
172
173
#### spinme.R
174
175
This wrapper makes it easy to spin a script into an HTML report from the command line (see the example command at the top of this readme).
176
177
## Notes
178
179
This repository has been tidied up so that only scripts relevant to the final publication are preserved. Various initial and exploratory analysis scripts have been removed for clarity. If for any reason these are of interest, they are present in commit 08934808c497a0f094c71a731cb9cb2564e4cc0f, the final commit before the tidy-up began.