## Dataset Tutorial

Let's first load several packages from DeepPurpose

In [1]:
# if you are using source version, uncomment the next two lines:
#import os
#os.chdir('../')

from DeepPurpose import utils, DTI, dataset

There are mainly three types of input data for DeepPurpose.
1. Target Sequence and its name to be repurposed.
2. Drug repurposing library.
3. Training drug-target pairs, along with the binding scores.

There are two ways to load the data.

The first is to use the DeepPurpose.dataset library loader, which is very simple and we preprocess the data for you. The list of dataset supported is listed here:
https://github.com/kexinhuang12345/DeepPurpose/blob/master/README.md#data

The second way is to read from local files, which should follow our data format, as we illustrated below. 

Here are some examples. First, let's show how to load some target sequences for COVID19.

In [2]:
target, target_name = dataset.load_SARS_CoV_Protease_3CL()
print('The target is: ' + target)
print('The target name is: ' + target_name)

The target is: SGFKKLVSPSSAVEKCIVSVSYRGNNLNGLWLGDSIYCPRHVLGKFSGDQWGDVLNLANNHEFEVVTQNGVTLNVVSRRLKGAVLILQTAVANAETPKYKFVKANCGDSFTIACSYGGTVIGLYPVTMRSNGTIRASFLAGACGSVGFNIEKGVVNFFYMHHLELPNALHTGTDLMGEFYGGYVDEEVAQRVPPDNLVTNNIVAWLYAAIISVKESSFSQPKWLESTTVSIEDYNRWASDNGFTPFSTSTAITKLSAITGVDVCKLLRTIMVKSAQWGSDPILGQYNFEDELTPESVFNQVGGVRLQ
The target name is: SARS-CoV 3CL Protease


In [3]:
target, target_name = dataset.load_SARS_CoV2_Protease_3CL()
print('The target is: ' + target)
print('The target name is: ' + target_name)

The target is: SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRVIGHSMQNCVLKLKVDTANPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSCGSVGFNIDYDCVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVNVLAWLYAAVINGDRWFLNRFTTTLNDFNLVAMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQ
The target name is: SARS-CoV2 3CL Protease


We also support to read from local txt files. For target sequence, we assume it has one line, and the first is the target name, and space, and followed by targe amino acid sequence.

RNA_polymerase_SARS_CoV2_target_seq.txt:

RNA_polymerase_SARS_CoV2 SADAQS...PHTVLQ 

In [4]:
target, target_name = dataset.read_file_target_sequence('./toy_data/RNA_polymerase_SARS_CoV2_target_seq.txt')
print('The target is: ' + target)
print('The target name is: ' + target_name)

The target is: SADAQSFLNRVCGVSAARLTPCGTGTSTDVVYRAFDIYNDKVAGFAKFLKTNCCRFQEKDEDDNLIDSYFVVKRHTFSNYQHEETIYNLLKDCPAVAKHDFFKFRIDGDMVPHISRQRLTKYTMADLVYALRHFDEGNCDTLKEILVTYNCCDDDYFNKKDWYDFVENPDILRVYANLGERVRQALLKTVQFCDAMRNAGIVGVLTLDNQDLNGNWYDFGDFIQTTPGSGVPVVDSYYSLLMPILTLTRALTAESHVDTDLTKPYIKWDLLKYDFTEERLKLFDRYFKYWDQTYHPNCVNCLDDRCILHCANFNVLFSTVFPPTSFGPLVRKIFVDGVPFVVSTGYHFRELGVVHNQDVNLHSSRLSFKELLVYAADPAMHAASGNLLLDKRTTCFSVAALTNNVAFQTVKPGNFNKDFYDFAVSKGFFKEGSSVELKHFFFAQDGNAAISDYDYYRYNLPTMCDIRQLLFVVEVVDKYFDCYDGGCINANQVIVNNLDKSAGFPFNKWGKARLYYDSMSYEDQDALFAYTKRNVIPTITQMNLKYAISAKNRARTVAGVSICSTMTNRQFHQKLLKSIAATRGATVVIGTSKFYGGWHNMLKTVYSDVENPHLMGWDYPKCDRAMPNMLRIMASLVLARKHTTCCSLSHRFYRLANECAQVLSEMVMCGGSLYVKPGGTSSGDATTAYANSVFNICQAVTANVNALLSTDGNKIADKYVRNLQHRLYECLYRNRDVDTDFVNEFYAYLRKHFSMMILSDDAVVCFNSTYASQGLVASIKNFKSVLYYQNNVFMSEAKCWTETDLTKGPHEFCSQHTMLVKQGDDYVYLPYPDPSRILGAGCFVDDIVKTDGTLMIERFVSLAIDAYPLTKHPNQEYADVFHLYLQYIRKLHDELTGHMLDMYSVMLTNDNTSRYWEPEFYEAMYTPHTVLQ
The target name is: RNA_polymerase_SARS_CoV2


Now, let's move on to drug repurposing library. We currently support an antiviral drugs library and the broad repurposing library. 

In [5]:
X_repurpose, Drug_Names, Drug_CIDs = dataset.load_antiviral_drugs()

In [6]:
X_repurpose[:3]

array(['C1CC1NC2=C3C(=NC(=N2)N)N(C=N3)C4CC(C=C4)CO',
       'C1=NC2=C(N1COCCO)NC(=NC2=O)N',
       'C1=NC(=C2C(=N1)N(C=N2)CCOCP(=O)(O)O)N'], dtype=object)

In [7]:
Drug_Names[:3]

array(['Abacavir', 'Aciclovir', 'Adefovir'], dtype=object)

In [8]:
Drug_CIDs[:3]

array([441300,   2022,  60172])

In the above example, the data is downloaded from the cloud and saved into default folder *'./data'*, you can also specify your PATH by *dataset.load_antiviral_drugs(PATH)*. 

We also allow option to not output PubChem CID by setting *dataset.load_antiviral_drugs(no_cid = True)*, this allows less lines for one line mode DeepPurpose, since in one line mode, the function expects only X_repurpose and Drug_Names. 

Similarly for Broad Repurposing Hub, we can do the same:

In [9]:
X_drug, Drug_Names, Drug_CIDs = dataset.load_broad_repurposing_hub()

In [10]:
X_drug[:3]

array(['CO\\N=C(\\C(=O)NC1C2SCC(CSc3nnnn3C)=C(N2C1=O)C(O)=O)c1csc(N)n1',
       'CN1CCN(CC1)c1c(F)cc2c3c1SCCn3cc(C(O)=O)c2=O',
       'C[C@H]1CN(C[C@@H](C)N1)c1c(F)c(N)c2c(c1F)n(cc(C(O)=O)c2=O)C1CC1'],
      dtype=object)

In [11]:
Drug_Names[:3]

array(['7-[[(2E)-2-(2-Amino-1,3-thiazol-4-yl)-2-methoxyiminoacetyl]amino]-3-[(1-methyltetrazol-5-yl)sulfanylmethyl]-8-oxo-5-thia-1-azabicyclo[4.2.0]oct-2-ene-2-carboxylic acid',
       'Rufloxacin', 'Sparfloxacin'], dtype=object)

This will first download the file from cloud to local default *'./data'* folder or you can input your data folder. 

Note that in the one line mode (*oneliner.repurpose()*), if you don't specify any *X_repurpose* library, the method will automatically use the Broad Repurposing Hub data and use the PubChem CIDs as the drug names since some drugs (as you can see from the above examples) are way too long.

Now, let's show how you can load your own library using txt file!

We assume the txt file consists of the following structure:

repurposing_library.txt

Rufloxacin CN1CCN(CC1)c1c(F)cc2c3c1SCCn3cc(C(O)=O)c2=O\
Sparfloxacin C[C@H]1CN(C[C@@H](C)N1)c1c(F)c(N)c2c(c1F)n(cc(C(O)=O)c2=O)C1CC1

In [13]:
X_drug, Drug_Names = dataset.read_file_repurposing_library('./toy_data/repurposing_data_examples.txt')

In [14]:
X_drug

array(['CN1CCN(CC1)c1c(F)cc2c3c1SCCn3cc(C(O)=O)c2=O',
       'C[C@H]1CN(CC@@HN1)c1c(F)c(N)c2c(c1F)n(cc(C(O)=O)c2=O)C1CC1'],
      dtype='<U58')

In [15]:
Drug_Names

array(['Rufloxacin', 'Sparfloxacin'], dtype='<U12')

Okay, let's now move to the final training dataset! There are in general two types of training dataset that we expect.

1. The drug-target pairs with the binding score or the interaction 1/0 label.
2. The bioassay data where there is only one target and many drugs are screened.

For the first one, we provide three data loaders for public available drug-target interaction datasets: KIBA, DAVIS, and BindingDB. Let's first talk about DAVIS.

In [16]:
X_drugs, X_targets, y = dataset.load_process_DAVIS(path = './data', binary = False, convert_to_log = True, threshold = 30)

Beginning Processing...
Beginning to extract zip file...
Default set to logspace (nM -> p) for easier regression
Done!


In [18]:
X_drugs[:2]

array(['CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N',
       'CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N'],
      dtype='<U92')

In [20]:
X_targets[:1]

array(['MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQVTVDEVLAEGGFAIVFLVRTSNGMKCALKRMFVNNEHDLQVCKREIQIMRDLSGHKNIVGYIDSSINNVSSGDVWEVLILMDFCRGGQVVNLMNQRLQTGFTENEVLQIFCDTCEAVARLHQCKTPIIHRDLKVENILLHDRGHYVLCDFGSATNKFQNPQTEGVNAVEDEIKKYTTLSYRAPEMVNLYSGKIITTKADIWALGCLLYKLCYFTLPFGESQVAICDGNFTIPDNSRYSQDMHCLIRYMLEPDPDKRPDIYQVSYFSFKLLKKECPIPNVQNSPIPAKLPEPVKASEAAAKKTQPKARLTDPIPTTETSIAPRQRPKAGQTQPNPGILPIQPALTPRKRATVQPPPQAAGSSNQPGLLASVPQPKPQAPPSQPLPQTQAKQPQAPPTPQQTPSTQAQGLPAQAQATPQHQQQLFLKQQQQQQQPPPAQQQPAGTFYQQQQAQTQQFQAVHPATQKPAIAQFPVVSQGGSQQQLMQNFYQQQQQQQQQQQQQQLATALHQQQLMTQQAALQQKPTMAAGQQPQPQPAAAPQPAPAQEPAIQAPVRQQPKVQTTPPPAVQGQKVGSLTPPSSPKTQRAGHRRILSDVTHSAVFGVPASKSTQLLQAAAAEASLNKSKSATTTPSGSPRTSQQNVYNPSEGSTWNPFDDDNFSKLTAEELLNKDFAKLGEGKHPEKLGGSAESLIPGFQSTQGDAFATTSFSAGTAEKRKGGQTVDSGLPLLSVSDPFIPLQVPDAPEKLIEGLKSPDTSLLLPDLLPMTDPFGSTSDAVIEKADVAVESLIPGLEPPVPQRLPSQTESVTSNRTDSLTGEDSLLDCSLLSNPTTDLLEEFAPTAISAPVHKAAEDSNLISGFDVPEGSDKVAEDEFDPIPVLITKNPQGGHSRNSSGSSESSLPNLARSLLLVDQLIDL'],
      dtype='<U2549')

In [21]:
y[:2]

array([7.36552273, 4.99999566])

DAVIS dataloader has several default parameters. The path is the saving path. The binary parameter asks if you want to convert the binding score to binary classification since lots of the models are aimed to do that. The convert_to_log is to transform from the Kd unit from nM to p which has a more normal distribution for easier regression. The threshold is for binary classification, the default is recommended but you could also tune your own.

Similarly, for KIBA.

In [23]:
X_drugs, X_targets, y = dataset.load_process_KIBA(path = './data', binary = False, threshold = 9)

Beginning Processing...
Beginning to extract zip file...
Done!


Another large dataset we support is BindingDB. There are three different thing from the KIBA and DAVIS data loader:
1. BindingDB is big (several GBs). So we provide a separate function to download BindingDB *download_BindingDB()*, which will return the downloaded file path for you. Then you can set the *path = download_BindingDB()* in the *process_BindingDB()* function.
2. BindingDB has four Binding values for drug target pairs: IC50, EC50, Kd, Ki. You should set the 'y' to one of them for the drug target pairs you would like.
3. The loading of BindingDB from local file to Pandas is also pretty slow. So instead of putting path into the function, you could also set the df = the bindingDB pandas dataframe object. 

In [26]:
data_path = dataset.download_BindingDB('./data/')
X_drugs, X_targets, y = dataset.process_BindingDB(path = data_path, df = None, y = 'Kd', binary = False, convert_to_log = True, threshold = 30)

Beginning to download dataset...
Beginning to extract zip file...
Done!
Loading Dataset from path...


b'Skipping line 772572: expected 193 fields, saw 205\nSkipping line 772598: expected 193 fields, saw 205\n'
b'Skipping line 805291: expected 193 fields, saw 205\n'
b'Skipping line 827961: expected 193 fields, saw 265\n'
b'Skipping line 1231688: expected 193 fields, saw 241\n'
b'Skipping line 1345591: expected 193 fields, saw 241\nSkipping line 1345592: expected 193 fields, saw 241\nSkipping line 1345593: expected 193 fields, saw 241\nSkipping line 1345594: expected 193 fields, saw 241\nSkipping line 1345595: expected 193 fields, saw 241\nSkipping line 1345596: expected 193 fields, saw 241\nSkipping line 1345597: expected 193 fields, saw 241\nSkipping line 1345598: expected 193 fields, saw 241\nSkipping line 1345599: expected 193 fields, saw 241\n'
b'Skipping line 1358864: expected 193 fields, saw 205\n'
b'Skipping line 1378087: expected 193 fields, saw 241\nSkipping line 1378088: expected 193 fields, saw 241\nSkipping line 1378089: expected 193 fields, saw 241\nSkipping line 1378090: e

Beginning Processing...
There are 66444 drug target pairs.
Default set to logspace (nM -> p) for easier regression


In [28]:
print('There are ' + str(len(X_drugs)) + ' drug-target pairs.')

There are 66444 drug-target pairs.


Now, let's show how to load it from txt file. We assume it has the following format:

dti.txt

CC1=C...C4)N MKK...LIDL 7.365 \
CC1=C...C4)N QQP...EGKH 4.999

In [29]:
X_drugs, X_targets, y = dataset.read_file_training_dataset_drug_target_pairs('./toy_data/dti.txt')

In [30]:
X_drugs

array(['CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N',
       'CC1=C2C=C(C=CC2=NN1)C3=CC(=CN=C3)OCC(CC4=CC=CC=C4)N'],
      dtype='<U51')

We are almost here! Now, in the end, let's look at bioassay data. We only write the AID1706 bioassay loader for now. But please check the source code since it is easy to produce another one. 

There are several things to look at.

1. we have a new balanced parameter. Since bioassay data usually are highly skewed (i.e. only few are hits and most of them are not), for a better training purpose, we can make the data slightly more balanced. 
2. The ratio of balancing can be tuned by the oversample_num parameter. It states the percentage of unbalanced:balanced data points.

In [31]:
X_drugs, X_targets, y = dataset.load_AID1706_SARS_CoV_3CL(path = './data', binary = True, threshold = 15, balanced = True, oversample_num = 30, seed = 1)

Beginning Processing...


  if (await self.run_code(code, result,  async_=asy)):
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val['binary_label'][(val.PUBCHEM_ACTIVITY_SCORE >= threshold) & (val.PUBCHEM_ACTIVITY_SCORE <=100)] = 1


Default binary threshold for the binding affinity scores is 15, recommended by the investigator
Done!


In the end, we show how to load customized bioassay training data. We assume the following format:

AID1706.txt

SGFKKLVSP...GVRLQ \
CCOC1...C=N4 0 \
CCCCO...=CS2 0 \
COC1=...23)F 0 \
C1=CC...)C#N 1 \
CC(=O...3.Cl 1

In [33]:
X_drugs, X_targets, y = dataset.read_file_training_dataset_bioassay('./toy_data/AID1706.txt')

In [36]:
X_drugs[:2]

array(['CCOC1=CC=C(C=C1)N2C=CC(=O)C(=N2)C(=O)NC3=CC=C(C=C3)S(=O)(=O)NC4=NC=CC=N4',
       'CCCCOC(=O)C1=CC=C(C=C1)NC(=O)/C=C/C2=CC=CS2'], dtype='<U79')

That's all for now! Definitely checkout more demos and tutorials for DeepPurpose! Let us know what you like or don't like for us to improve : )

