|
a |
|
b/data/README.md |
|
|
1 |
|
|
|
2 |
## Data Explanation |
|
|
3 |
|
|
|
4 |
For more detailed information, please refer to the [DeepDTA article](https://academic.oup.com/bioinformatics/article/34/17/i821/5093245). |
|
|
5 |
|
|
|
6 |
### Similarity files |
|
|
7 |
|
|
|
8 |
For each dataset, there are two similarity files, drug-drug and target-target similarities. |
|
|
9 |
* Drug-drug similarities obtained via Pubchem structure clustering. |
|
|
10 |
* Target-target similarities are obtained via S-W similarity. |
|
|
11 |
|
|
|
12 |
These files were used to re-produce the results of two other methods [(Pahikkala et al., 2017)](https://academic.oup.com/bib/article/16/2/325/246479) and [(He et al., 2017)](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0209-z), and also for some experiments in DeepDTA model, please refer to [paper](https://academic.oup.com/bioinformatics/article/34/17/i821/5093245). |
|
|
13 |
* The original Davis data and more explanation can be found [here](http://staff.cs.utu.fi/~aatapa/data/DrugTarget/). |
|
|
14 |
* The original KIBA data and more explanation can be found [here](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0209-z). |
|
|
15 |
|
|
|
16 |
### Binding affinity files |
|
|
17 |
|
|
|
18 |
* For davis dataset, standard value is Kd in nM. In the article, we used the transformation below: |
|
|
19 |
|
|
|
20 |
<a href="https://www.codecogs.com/eqnedit.php?latex=pK_{d}=-log_{10}\frac{K_d}{1e9}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?pK_{d}=-log_{10}\frac{K_d}{1e9}" title="pK_{d}=-log_{10}\frac{K_d}{1e9}" /></a> |
|
|
21 |
|
|
|
22 |
* For KIBA dataset, standard value is KIBA score. Two versions of the binding affinity value txt files correspond the original values and transformed values ([more information here](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0209-z)). In the article we used the tranformed form. |
|
|
23 |
|
|
|
24 |
* nan values indicate there is no experimental value for that drug-target pair. |
|
|
25 |
|
|
|
26 |
|
|
|
27 |
### Train and test folds |
|
|
28 |
There are two files for each dataset: train fold and test fold. Both of these files keep the position information for the binding affinity value given in binding affinity matrices in the text files. |
|
|
29 |
* Since we performed 5-fold cv, each fold file contains five different set of positions. |
|
|
30 |
* Test set is same for all five training sets. |
|
|
31 |
|
|
|
32 |
### For using the folds |
|
|
33 |
* Load affinity matrix Y |
|
|
34 |
|
|
|
35 |
```python |
|
|
36 |
import pickle |
|
|
37 |
import numpy as np |
|
|
38 |
|
|
|
39 |
Y = pickle.load(open("Y", "rb")) # Y = pickle.load(open("Y", "rb"), encoding='latin1') |
|
|
40 |
label_row_inds, label_col_inds = np.where(np.isnan(Y)==False) |
|
|
41 |
``` |
|
|
42 |
|
|
|
43 |
* label_row_inds: drug indices for the corresponding affinity matrix positions (flattened) |
|
|
44 |
e.g. 36275th point in the KIBA Y matrix indicates the 364th drug (same order in the SMILES file) |
|
|
45 |
```python |
|
|
46 |
label_row_inds[36275] |
|
|
47 |
``` |
|
|
48 |
|
|
|
49 |
* label_col_inds: protein indices for the corresponding affinity matrix positions (flattened) |
|
|
50 |
|
|
|
51 |
e.g. 36275th point in the KIBA Y matrix indicates the 120th protein (same order in the protein sequence file) |
|
|
52 |
```python |
|
|
53 |
label_col_inds[36275] |
|
|
54 |
``` |
|
|
55 |
|
|
|
56 |
* You can then load the fold files as follows: |
|
|
57 |
```python |
|
|
58 |
import json |
|
|
59 |
test_fold = json.load(open(yourdir + "folds/test_fold_setting1.txt")) |
|
|
60 |
train_folds = json.load(open(yourdir + "folds/train_fold_setting1.txt")) |
|
|
61 |
|
|
|
62 |
test_drug_indices = label_row_inds[test_fold] |
|
|
63 |
test_protein_indices = label_col_inds[test_fold] |
|
|
64 |
|
|
|
65 |
``` |
|
|
66 |
|
|
|
67 |
Remember that, ```train_folds``` contain an array of 5 lists, each of which correspond to a training set. |