[548210]: / docs / usage / getting-started.md

Download this file

406 lines (338 with data), 13.6 kB

Getting started

Welcome! This tutorial highlights the OpenOmics API’s core features; for in-depth details and conceptual guides, see the links within, or the documentation index which has links to use cases, and API reference sections.

Loading a single-omics dataframe

Suppose you have a single-omics dataset and would like to load them as a dataframe.

As an example, we use the TGCA Lung Adenocarcinoma dataset
from tests/data/TCGA_LUAD. Data tables are
tab-delimited and have the following format:

GeneSymbol EntrezID TCGA-05-4244-01A-01R-1107-07 TCGA-05-4249-01A-01R-1107-07 ...
A1BG 100133144 10.8123 3.7927 ...

Depending on whether your data table is stored locally as a single file, splitted into multiple files, or was already a dataframe, you can load it using the class {class}openomics.transcriptomics.Expression or any of its subclasses.

````{tab} From a single file
If the dataset is a local file in a tabular format, OpenOmics can help you load them to Pandas dataframe.

```{code-block} python
from openomics.multiomics import MessengerRNA

mrna = MessengerRNA(
data="https://raw.githubusercontent.com/JonnyTran/OpenOmics/master/tests/data/TCGA_LUAD/LUAD__geneExp.txt",
transpose=True,
usecols="GeneSymbol|TCGA", # A regex that matches all column name with either "GeneSymbol" or "TCGA substring
gene_index="GeneSymbol", # This column contains the gene index
)

One thing to pay attention is that the raw data file given is column-oriented where columns corresponds to samples, so we have use the argument `transpose=True` to convert to row-oriented.
> MessengerRNA (576, 20472)
````

````{tab} From multiple files (glob)
If your dataset is large, it may be broken up into multiple files with a similar file name prefix/suffix. Assuming all the files have similar tabular format, OpenOmics can load all files and contruct an integrated data table using the memory-efficient Dask dataframe.

```python
from openomics.multiomics import MessengerRNA

mrna = MessengerRNA("TCGA_LUAD/LUAD__*", # Files must be stored locally
                    transpose=True,
                    usecols="GeneSymbol|TCGA",
                    gene_index="GeneSymbol")

INFO: Files matched: ['LUAD__miRNAExp__RPM.txt', 'LUAD__protein_RPPA.txt', 'LUAD__geneExp.txt']

````{tab} From DataFrame
If your workflow already produced a dataframe, you can encapsulate it directly with {class}`openomics.transcriptomics.Expression`.

```python
import pandas as pd
import numpy as np
from openomics.multiomics import MessengerRNA

# A random dataframe of microRNA gene_id's.
df = pd.DataFrame(data={"ENSG00000194717": np.random.rand(5),
                        "ENSG00000198973": np.random.rand(5),
                        "ENSG00000198974": np.random.rand(5),
                        "ENSG00000198975": np.random.rand(5),
                        "ENSG00000198976": np.random.rand(5),
                        "ENSG00000198982": np.random.rand(5),
                        "ENSG00000198983": np.random.rand(5)},
                  index=range(5))
mrna = MessengerRNA(df, transpose=False, sample_level="sample_id")
```

To access the {class}DataFrame, simply use {obj}mrna.expressions:

print(mrna.expressions)
<style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
GeneSymbol A1BG A1BG-AS1 A1CF A2M
sample_index
TCGA-05-4244-01A-01R-1107-07 26.0302 36.7711 0.000 9844.7858
TCGA-05-4249-01A-01R-1107-07 120.1349 132.1439 0.322 25712.6617


Creating a multi-omics dataset

With multiple single-omics, each with different sets of genes and samples, you can use the {class}openomics.MultiOmics to integrate them.

```{code-block} python
from openomics.multiomics import MultiOmics, MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein

path = "https://raw.githubusercontent.com/JonnyTran/OpenOmics/master/tests/data/TCGA_LUAD/"

Load each expression dataframe

mRNA = MessengerRNA(path+"LUAD__geneExp.txt",
transpose=True,
usecols="GeneSymbol|TCGA",
gene_index="GeneSymbol")
miRNA = MicroRNA(path+"LUAD__miRNAExp__RPM.txt",
transpose=True,
usecols="GeneSymbol|TCGA",
gene_index="GeneSymbol")
lncRNA = LncRNA(path+"TCGA-rnaexpr.tsv",
transpose=True,
usecols="Gene_ID|TCGA",
gene_index="Gene_ID")
som = SomaticMutation(path+"LUAD__somaticMutation_geneLevel.txt",
transpose=True,
usecols="GeneSymbol|TCGA",
gene_index="GeneSymbol")
pro = Protein(path+"protein_RPPA.txt",
transpose=True,
usecols="GeneSymbol|TCGA",
gene_index="GeneSymbol")

Create an integrated MultiOmics dataset

luad_data = MultiOmics(cohort_name="LUAD", omics_data=[mRNA, mRNA, lncRNA, som, pro])

You can also add individual -omics one at a time luad_data.add_omic(mRNA)

luad_data.build_samples()

The `luad_data` is a {class}`MultiOmics` object builds the samples list from all the samples given in each -omics data.

> MessengerRNA (576, 20472)
> MicroRNA (494, 1870)
> LncRNA (546, 12727)
> SomaticMutation (587, 21070)
> Protein (364, 154)

To access individual -omics data within `luad_data`, such as the {obj}`mRNA`, simply use the `.` accessor with the class name {class}`MessengerRNA`:
```python
luad_data.MessengerRNA
# or
luad_data.data["MessengerRNA"]


Adding clinical data as sample attributes

When sample attributes are provided for the study cohort, load it as a data table with the {class}openomics.clinical.ClinicalData, then add it to the {class}openomics.multiomics.MultiOmics dataset to enable querying for subsets of samples across the multi-omics.

from openomics import ClinicalData

clinical = ClinicalData(
    "https://raw.githubusercontent.com/JonnyTran/OpenOmics/master/tests/data/TCGA_LUAD/nationwidechildrens.org_clinical_patient_luad.txt",
    patient_index="bcr_patient_barcode")

luad_data.add_clinical_data(clinical)

luad_data.clinical.patient
<style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
bcr_patient_uuid form_completion_date histologic_diagnosis prospective_collection retrospective_collection
bcr_patient_barcode
TCGA-05-4244 34040b83-7e8a-4264-a551-b16621843e28 2010-7-22 Lung Adenocarcinoma NO YES
TCGA-05-4245 03d09c05-49ab-4ba6-a8d7-e7ccf71fafd2 2010-7-22 Lung Adenocarcinoma NO YES
TCGA-05-4249 4addf05f-3668-4b3f-a17f-c0227329ca52 2010-7-22 Lung Adenocarcinoma NO YES

Note that in the clinical data table, bcr_patient_barcode is the column with TCGA-XX-XXXX patient IDs, which matches
that of the sample_index index column in the mrna.expressions dataframe.

In our `TCGA_LUAD` example, mismatches in the `bcr_patient_barcode` sample index of clinical dataframe may happen because the `sample_index` in `mRNA` may have a longer form `TCGA-XX-XXXX-XXX-XXX-XXXX-XX` that contain the samples number and aliquot ID's. To make them match, you can modify the index strings on-the-fly using the [Pandas's extensible API](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html):
```python
mRNA.expressions.index = mRNA.expressions.index.str.slice(0, 12) # Selects only the first 12 characters
```


Import an external database

Next, we may want to annotate the genes list in our RNA-seq expression dataset with genomics annotation. To do so, we'd need to download annotations from the GENCODE database, preprocess annotation files into a dataframe, and then match them with the genes in our dataset.

OpenOmics provides a simple, hassle-free API to download the GENCODE annotation files via FTP with these steps:
1. First, provide the base path of the FTP download server - usually found in the direct download link on GENCODE's website. Most of the time, selecting the right base path allows you to specify the specific species, genome assembly, and database version for your study.
2. Secondly, use the file_resources dict parameter to select the data files and the file paths required to construct the annotation dataframe. For each entry in the file_resources, the key is the alias of the file required, and the value is the filename with the FTP base path.

For example, the entry {"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz"} indicates the GENCODE class to preprocess a .gtf file with the alias "long_noncoding_RNAs.gtf", downloaded from the FTP path ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.long_noncoding_RNAs.gtf.gz

To see which file alias keys are required to construct a dataframe, refer to the docstring in {class}openomics.database.sequence.GENCODE.

from openomics.database.sequence import GENCODE

gencode = GENCODE(
    path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
    file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz",
                    "basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz",
                    "lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz", # lncRNA sequences
                    "transcripts.fa": "gencode.v32.transcripts.fa.gz" # mRNA sequences
                    },
    blocksize='100MB', # if not null, then use partition the dataframe with Dask to this size and leverage out-of-core multiprocessing
)

To access the attributes constructed from the combination of annotations long_noncoding_RNAs.gtf and basic.annotation.gtf, use:

gencode.data
<style scoped=""> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style>
gene_id gene_name index seqname source feature start end
0 ENSG00000243485 MIR1302-2HG 0 chr1 HAVANA gene 29554 31109
1 ENSG00000243485 MIR1302-2HG 1 chr1 HAVANA transcript 29554 31097


Annotate your expression dataset with attributes

With the annotation database, you can perform a join operation to add gene attributes to your {class}openomics.transcriptomics.Expression dataset. To annotate attributes for the gene_id list mRNA.expression, you must first select the corresponding column in gencode.data with matching gene_id keys. The following are code snippets for a variety of database types.

````{tab} Genomics attributes

luad_data.MessengerRNA.annotate_attributes(gencode,
    index="gene_id",
    columns=['gene_name', 'start', 'end', 'strand'] # Add these columns to the .annotations dataframe
    )
````{tab} Sequences
```python
luad_data.MessengerRNA.annotate_sequences(gencode,
    index="gene_name",
    agg_sequences="all", # Collect all sequences with the gene_name into a list
    )
```

````{tab} Disease Associations

from openomics.database.disease import DisGeNet
disgenet = DisGeNet(path="https://www.disgenet.org/static/disgenet_ap1/files/downloads/", curated=True)

luad_data.MessengerRNA.annotate_diseases(disgenet, index="gene_name")

````


To view the resulting annotations dataframe, use:

luad_data.MessengerRNA.annotations

For more detailed guide, refer to the annotation interfaces API.