Diff of /data/README.md [000000] .. [8af014]

Switch to unified view

a b/data/README.md
1
2
## Data Explanation
3
4
For more detailed information, please refer to the [DeepDTA article](https://academic.oup.com/bioinformatics/article/34/17/i821/5093245).
5
6
### Similarity files
7
8
For each dataset, there are two similarity files, drug-drug and target-target similarities.
9
*  Drug-drug similarities obtained via Pubchem structure clustering.
10
*  Target-target similarities are obtained via S-W similarity.
11
12
These files were used to re-produce the results of two other methods [(Pahikkala et al., 2017)](https://academic.oup.com/bib/article/16/2/325/246479) and [(He et al., 2017)](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0209-z), and also for some experiments in DeepDTA model, please refer to [paper](https://academic.oup.com/bioinformatics/article/34/17/i821/5093245). 
13
*  The original Davis data and more explanation can be found [here](http://staff.cs.utu.fi/~aatapa/data/DrugTarget/).
14
*  The original KIBA data and more explanation can be found [here](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0209-z).
15
16
### Binding affinity files
17
18
*  For davis dataset, standard value is Kd in nM. In the article, we used the transformation below:
19
20
<a href="https://www.codecogs.com/eqnedit.php?latex=pK_{d}=-log_{10}\frac{K_d}{1e9}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?pK_{d}=-log_{10}\frac{K_d}{1e9}" title="pK_{d}=-log_{10}\frac{K_d}{1e9}" /></a>
21
22
* For KIBA dataset, standard value is KIBA score. Two versions of the binding affinity value txt files correspond the original values and transformed values ([more information here](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0209-z)). In the article we used the tranformed form. 
23
24
* nan values indicate there is no experimental value for that drug-target pair.
25
26
27
### Train and test folds
28
There are two files for each dataset: train fold and test fold. Both of these files keep the position information for the binding affinity value given in binding affinity matrices in the text files. 
29
*  Since we performed 5-fold cv, each fold file contains five different set of positions.
30
*  Test set is same for all five training sets.
31
32
### For using the folds
33
*   Load affinity matrix Y 
34
35
```python
36
import pickle
37
import numpy as np
38
39
Y = pickle.load(open("Y", "rb"))  # Y = pickle.load(open("Y", "rb"), encoding='latin1')
40
label_row_inds, label_col_inds = np.where(np.isnan(Y)==False)
41
```
42
43
*  label_row_inds: drug indices for the corresponding affinity matrix positions (flattened)  
44
    e.g. 36275th point in the KIBA Y matrix indicates the 364th drug (same order in the SMILES file) 
45
    ```python
46
    label_row_inds[36275]
47
    ```
48
49
*  label_col_inds: protein indices for the corresponding affinity matrix positions (flattened)
50
51
    e.g.  36275th point in the KIBA Y matrix indicates the 120th protein (same order in the protein sequence file) 
52
    ```python
53
    label_col_inds[36275]
54
    ```
55
    
56
*   You can then load the fold files as follows:
57
    ```python
58
    import json
59
    test_fold = json.load(open(yourdir + "folds/test_fold_setting1.txt"))
60
    train_folds = json.load(open(yourdir + "folds/train_fold_setting1.txt"))
61
    
62
    test_drug_indices = label_row_inds[test_fold]
63
    test_protein_indices = label_col_inds[test_fold]
64
    
65
    ```
66
    
67
    Remember that, ```train_folds``` contain an array of 5 lists, each of which correspond to a training set.