--- a +++ b/README.md @@ -0,0 +1,42 @@ +medGAN +========================================= +medGAN is a generative adversarial network for generating multi-label discrete patient records. It can generate both binary and count variables (i.e. medical codes such as diagnosis codes, medication codes or procedure codes). + +#### Relevant Publications + +medGAN implements the algorithm introduced in the following [paper](https://arxiv.org/abs/1703.06490): + + Generating Multi-label Discrete Patient Records using Generative Adversarial Networks + Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F. Stewart, Jimeng Sun + Machine Learning for Healthcare (MLHC) 2017 + +#### Code Description + +This code trains a generative adversarial network to generate patient records. This work currently can handle patient records that are aggregated over time, hence represented as a matrix where a row corresponds to a patient, and a column to a specific medical code (e.g. diagonsis code, medication code, or procedure code). The value of the matrix could either be binary (i.e. a specific medical code occurred in the longitudinal patient record or not) or count (i.e. how many times a specific medical code occurred in the longitudinal patient record). + +#### Running GRAM + +**STEP 1: Installation** + +1. medGAN was implemented to run on [TensorFlow](https://www.python.org/) 1.2. TensorFlow can be easily installed in Ubuntu as suggested [here](https://www.tensorflow.org/install/install_linux) + +2. Download/clone the medGAN code + +**STEP 2: Fast way to test medGAN with MIMIC-III** +This step describes how to train medGAN, with minimum number of steps using MIMIC-III. + +0. You will first need to request access for [MIMIC-III](https://mimic.physionet.org/gettingstarted/access/), a publicly avaiable electronic health records collected from ICU patients over 11 years. + +1. You can use "process_mimic.py" to process MIMIC-III dataset and generate a suitable training dataset for medGAN. +Place the script to the same location where the MIMIC-III CSV files are located, and run the script. +The execution command is `python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv <output file> <"binary"|"count">`. +Note that the last argument decides whether you construct a binary matrix or a count matrix. +The above command will extract ICD9 diagnosis codes from MIMIC-III. +Mind that this script will use only 3 digits of the ICD9 diagnosis code. If you want to use all 5 digits, please see the source code of "process_mimic.py". + +2. Run medGAN using the ".matrix" file generated by process_mimic.py. The command is: +`python medgan.py <matrix file> <output path> --data_type=["binary", "count"]`. + +3. After the training, if you want to generate synthetic records, use this command : +`python medgan.py <matrix file> <generated output path> --model_file=<trained output path> --generate_data=True --data_type=["binary", "count"]`. +Note that `<matrix file>` is not actually used for generating synthetic records, so it is just a dummy input.