DrugCell is an interpretable neural network-based model that predicts
cell response to a wide range of drugs. Unlike fully-connected neural networks,
connectivity of neurons in the DrugCell mirrors a
biological hierarchy (e.g. Gene Ontology), so that the information travels
only between subsystems (or pathways) with known hierarchical relationship
during the model training.
This feature of the framework allows for identification of
subsystems in the hierarchy that are important to the model's prediction,
warranting further investigation on underlying biological mechanisms of
cell response to treatments.
The current version (v1.0) of the DrugCell model
is trained using 509,294 (cell line, drug) pairs across
1,235 tumor cell lines and 684 drugs. The training data is retrieved from Genomics of
Drug Sensitivity in Cancer database (GDSC) and the Cancer Therapeutics Response
Portal (CTRP) v2.
DrugCell characterizes each cell line using its genotype;
the feature vector for each cell is a binary vector representing
mutational status of the top 15% most frequently mutated genes (n = 3,008)
in cancer.
Drugs are encoded using Morgan Fingerprint (radius = 2), and the resulting
feature vectors are binary vectors of length 2,048.
DrugCell training/testing scripts require the following environmental setup:
Hardware required for training a new model
Software
angular2
conda install pytorch torchvision -c pytorch
conda install pytorch torchvision cpuonly -c pytorch
angular2
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
Set up a virtual environment
angular2
conda env create -f environment_cpu_mac.yml
angular2
conda env create -f environment_cpu_linux.yml
angular2
conda env create -f environment.yml
source activate pytorch3drugcell (or pytorch3drugcellcpu)
DrugCell v1.0 was trained using (cell line, drug) pairs, but
it can be generalized to estimate response of any cells to any drugs if:
1. The feature vector of cell is built as a binary vector representing
mutational status of 3,008 genes (the list of index and name of the genes
is provided in gene2ind.txt).
2. The feature vector of drug is encoded into a binary vector of length 2,048
using Morgan Fingerprint (radius = 2). We also provide the pre-computed
feature vectors for 684 drugs in our training data (drug2fingerprint.txt).
Pre-trained DrugCell v1.0 model and the drug response data for 509,294 (cell line, drug) pairs used to train the model is shared in http://drugcell.ucsd.edu/downloads.
Required input files:
1. Cell feature files: gene2ind.txt, cell2ind.txt, cell2mutation.txt
* gene2ind.txt: make sure you are using gene2ind.txt file provided in this repository.
* cell2ind.txt: a tab-delimited file where the 1st column is index of cells and the 2nd column is the name of cells (genotypes).
* cell2mutation.txt: a comma-delimited file where each row has 3,008 binary values indicating each gene is mutated (1) or not (0).
The column index of each gene should match with those in gene2ind.txt file. The line number should
match with the indices of cells in cell2ind.txt file.
2. Drug feature files: drug2ind, drug2fingerprints
* drug2ind.txt: a tab-delimited file where the 1st column is index of drug and the 2nd column is
identification of each drug (e.g., SMILES representation or name). The identification of drugs
should match to those in drug2fingerprint.txt file.
* drug2fingerprint.txt: a comma-delimited file where each row has 2,048 binary values which would form
, when combined, a Morgan Fingerprint representation of each drug.
The line number of should match with the indices of drugs in drug2ind.txt file.
3. Test data file: drugcell_test.txt
* A tab-delimited file containing all data points that you want to estimate drug response for.
The 1st column is identification of cells (genotypes) and the 2nd column is identification of
drugs.
To load a pre-trained model used for analyses in our manuscript and make prediction for (cell, drug) pairs of
your interest, execute the following:
Make sure you have gene2ind.txt, cell2ind.txt, cell2mutation.txt, drug2ind.txt,
drug2fingerprint.txt, and your file containing test data in proper format (examples are provided in
data and sample folder)
To run the model in a GPU server, execute the following:
python predict_drugcell.py -gene2id gene2ind.txt
-cell2id cell2ind.txt
-drug2id drug2ind.txt
-genotype cell2mutation.txt
-fingerprint drug2fingerprint.txt
-predict testdata.txt
-hidden <path_to_directory_to_store_hidden_values>
-result <path_to_directory_to_store_prediction_results>
-load <path_to_model_file>
-cuda <GPU_unit_to_use> (optional)
To load and test the DrugCell model in CPU, run predict_drugcell_cpu.py
(instead of predict_drugcell.py) with same set of parameters as 2. -cuda option is
not available in this scenario.
To train a new DrugCell model using a custom data set, first make sure that you have
a proper virtual environment set up. Also make sure that you have all the required files
to run the training scripts:
Cell feature files: gene2ind.txt, cell2ind.txt, cell2mutation.txt
Drug feature files: drug2ind.txt, drug2fingerprints.txt
Training data file: drugcell_train.txt
Validation data file: drugcell_val.txt
Ontology (hierarchy) file: drugcell_ont.txt
A tab-delimited file that contains the ontology (hierarchy) that defines the structure of a branch
of a DrugCell model that encodes the genotypes. The first column is always a term (subsystem or pathway),
and the second column is a term or a gene.
The third column should be set to "default" when the line represents a link between terms,
"gene" when the line represents an annotation link between a term and a gene.
The following is an example describing a sample hierarchy.
GO:0045834 GO:0045923 default
GO:0045834 GO:0043552 default
GO:0045923 AKT2 gene
GO:0045923 IL1B gene
GO:0043552 PIK3R4 gene
GO:0043552 SRC gene
GO:0043552 FLT1 gene
There are a few optional parameters that you can provide in addition to the input files:
-model: a name of directory where you want to store the trained models. The default
is set to "MODEL" in the current working directory.
-genotype_hiddens: a number of neurons to assign each subsystem in the hierarchy.
The default is set to 6.
-drug_hiddens: a string listing the number of neurons for the drug-encoding branch
of DrugCell. The number should be delimited by comma. The default value is "100,50,6",
and with the default option,
the drug branch of the resulting DrugCell model will be a fully-connected neural network with 3 layers
consisting of 100, 50, and 6 neurons.
-final_hiddens: the number of neurons in the top layer of DrugCell that combines
the genotype-encoding and the drug-encoding branches. The default is 6.
-epoch: the number of epoch to run during the training phase. The default is set to 300.
-batchsize: the size of each batch to process at a time. The deafult is set to 5000.
You may increase this number to speed up the training process within the memory capacity
of your GPU server.
-cuda: the ID of GPU unit that you want to use for the model training. The default setting
is to use GPU 0.
Finally, to train a DrugCell model, execute a command line similar to the example provided in
sample/commandline_cuda.sh:
python -u train_drugcell.py -onto drugcell_ont.txt
-gene2id gene2ind.txt
-cell2id cell2ind.txt
-drug2id drug2ind.txt
-genotype cell2mutation.txt
-fingerprint drug2fingerprints.txt
-train drugcell_train.txt
-test drugcell_val.txt
-model ./MODEL
-genotype_hiddens 6
-drug_hiddens "100,50,6"
-final_hiddens 6
-epoch 100
-batchsize 5000
-cuda 1
There are three subsets of our training data provided as toy example: drugcell_train.txt, drugcell_test.txt and drugcell_val.txt have 10,000, 1,000, and 1,000 (cell line, drug) pairs along with the corresponding drug response (area under the dose-response curve).