Switch to unified view

a b/docs/usage/annotate-external-databases.md
1
# External annotation databases
2
3
## Import GENCODE human release 32
4
5
Next, we can annotate the genes in our GTEx expression dataset with genomics annotation from GENCODE. In this example,
6
we use the URL path prefix "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/" which specifies the
7
species and release version. We also pass a dictionary `file_resources`, with key-value pairs where the key is name of
8
file and value is the suffix of the file download URL.
9
10
For example, file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz"} will download file
11
located at <ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.long_noncoding_RNAs.gtf.gz>
12
to process the `long_noncoding_RNAs.gtf` file.
13
14
Here, we loads both "long_noncoding_RNAs.gtf" and "basic.annotation.gtf" which builds a dataframe of combined
15
annotations for both lncRNAs and mRNAs. You can specify different annotation files options from GENCODE by modifying
16
the `file_resources` dict argument.
17
18
```{code-block} python
19
from openomics.database import GENCODE, EnsemblGenes
20
21
gencode = GENCODE(path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
22
                 file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz",
23
                                 "basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz",
24
                                 "lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz",
25
                                 "transcripts.fa": "gencode.v32.transcripts.fa.gz"},
26
                 remove_version_num=True)
27
28
# We also loads Ensembl genes to get list of miRNA gene IDs
29
ensembl = EnsemblGenes(biomart='hsapiens_gene_ensembl', npartitions=8, )
30
```
31
32
## Setting the cache download directory
33
The package `astropy` is used to automatically cache downloaded files. It defaults to saving the files at
34
`~/.astropy/cache/`, where the cached content is retrieved given the matching URL. To change the path for the cache download file, run:
35
36
```python
37
import openomics
38
39
openomics.set_cache_dir(path="PATH/OF/YOUR/CHOICE/")
40
```
41
42
```{note}
43
Note that this setting doesn't persist across different programming sessions. Ideally, the cache dir should be in one location to minimize automatic FTP downloads, which may cause unnecessary stress on the database server.
44
```
45