Diff of /data/README.md [000000] .. [8af014]

Switch to side-by-side view

--- a
+++ b/data/README.md
@@ -0,0 +1,67 @@
+
+## Data Explanation
+
+For more detailed information, please refer to the [DeepDTA article](https://academic.oup.com/bioinformatics/article/34/17/i821/5093245).
+
+### Similarity files
+
+For each dataset, there are two similarity files, drug-drug and target-target similarities.
+*  Drug-drug similarities obtained via Pubchem structure clustering.
+*  Target-target similarities are obtained via S-W similarity.
+
+These files were used to re-produce the results of two other methods [(Pahikkala et al., 2017)](https://academic.oup.com/bib/article/16/2/325/246479) and [(He et al., 2017)](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0209-z), and also for some experiments in DeepDTA model, please refer to [paper](https://academic.oup.com/bioinformatics/article/34/17/i821/5093245). 
+*  The original Davis data and more explanation can be found [here](http://staff.cs.utu.fi/~aatapa/data/DrugTarget/).
+*  The original KIBA data and more explanation can be found [here](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0209-z).
+
+### Binding affinity files
+
+*  For davis dataset, standard value is Kd in nM. In the article, we used the transformation below:
+
+<a href="https://www.codecogs.com/eqnedit.php?latex=pK_{d}=-log_{10}\frac{K_d}{1e9}" target="_blank"><img src="https://latex.codecogs.com/gif.latex?pK_{d}=-log_{10}\frac{K_d}{1e9}" title="pK_{d}=-log_{10}\frac{K_d}{1e9}" /></a>
+
+* For KIBA dataset, standard value is KIBA score. Two versions of the binding affinity value txt files correspond the original values and transformed values ([more information here](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0209-z)). In the article we used the tranformed form. 
+
+* nan values indicate there is no experimental value for that drug-target pair.
+
+
+### Train and test folds
+There are two files for each dataset: train fold and test fold. Both of these files keep the position information for the binding affinity value given in binding affinity matrices in the text files. 
+*  Since we performed 5-fold cv, each fold file contains five different set of positions.
+*  Test set is same for all five training sets.
+
+### For using the folds
+*   Load affinity matrix Y 
+
+```python
+import pickle
+import numpy as np
+
+Y = pickle.load(open("Y", "rb"))  # Y = pickle.load(open("Y", "rb"), encoding='latin1')
+label_row_inds, label_col_inds = np.where(np.isnan(Y)==False)
+```
+
+*  label_row_inds: drug indices for the corresponding affinity matrix positions (flattened)  
+    e.g. 36275th point in the KIBA Y matrix indicates the 364th drug (same order in the SMILES file) 
+    ```python
+    label_row_inds[36275]
+    ```
+
+*  label_col_inds: protein indices for the corresponding affinity matrix positions (flattened)
+
+    e.g.  36275th point in the KIBA Y matrix indicates the 120th protein (same order in the protein sequence file) 
+    ```python
+    label_col_inds[36275]
+    ```
+    
+*   You can then load the fold files as follows:
+    ```python
+    import json
+    test_fold = json.load(open(yourdir + "folds/test_fold_setting1.txt"))
+    train_folds = json.load(open(yourdir + "folds/train_fold_setting1.txt"))
+    
+    test_drug_indices = label_row_inds[test_fold]
+    test_protein_indices = label_col_inds[test_fold]
+    
+    ```
+    
+    Remember that, ```train_folds``` contain an array of 5 lists, each of which correspond to a training set.