Switch to side-by-side view

--- a
+++ b/docs/usage/getting-started.md
@@ -0,0 +1,405 @@
+# Getting started
+
+Welcome! This tutorial highlights the OpenOmics API’s core features; for in-depth details and conceptual guides, see the links within, or the documentation index which has links to use cases, and API reference sections.
+
+## Loading a single-omics dataframe
+
+Suppose you have a single-omics dataset and would like to load them as a dataframe.
+
+As an example, we use the `TGCA` Lung Adenocarcinoma dataset
+from [tests/data/TCGA_LUAD](https://github.com/JonnyTran/OpenOmics/tree/master/tests/data/TCGA_LUAD). Data tables are
+tab-delimited and have the following format:
+
+| GeneSymbol | EntrezID  | TCGA-05-4244-01A-01R-1107-07 | TCGA-05-4249-01A-01R-1107-07 | ...  |
+| ---------- | --------- | ---------------------------- | ---------------------------- | ---- |
+| A1BG       | 100133144 | 10.8123                      | 3.7927                       | ...  |
+| ⋮ | ⋮ | ⋮ | ⋮ |
+
+Depending on whether your data table is stored locally as a single file, splitted into multiple files, or was already a dataframe, you can load it using the class {class}`openomics.transcriptomics.Expression` or any of its subclasses.
+
+````{tab} From a single file
+If the dataset is a local file in a tabular format, OpenOmics can help you load them to Pandas dataframe.
+
+```{code-block} python
+from openomics.multiomics import MessengerRNA
+
+mrna = MessengerRNA(
+    data="https://raw.githubusercontent.com/JonnyTran/OpenOmics/master/tests/data/TCGA_LUAD/LUAD__geneExp.txt",
+    transpose=True,
+    usecols="GeneSymbol|TCGA", # A regex that matches all column name with either "GeneSymbol" or "TCGA substring
+    gene_index="GeneSymbol", # This column contains the gene index
+    )
+```
+
+One thing to pay attention is that the raw data file given is column-oriented where columns corresponds to samples, so we have use the argument `transpose=True` to convert to row-oriented.
+> MessengerRNA (576, 20472)
+````
+
+````{tab} From multiple files (glob)
+If your dataset is large, it may be broken up into multiple files with a similar file name prefix/suffix. Assuming all the files have similar tabular format, OpenOmics can load all files and contruct an integrated data table using the memory-efficient Dask dataframe.
+
+```python
+from openomics.multiomics import MessengerRNA
+
+mrna = MessengerRNA("TCGA_LUAD/LUAD__*", # Files must be stored locally
+                    transpose=True,
+                    usecols="GeneSymbol|TCGA",
+                    gene_index="GeneSymbol")
+```
+
+> INFO: Files matched: ['LUAD__miRNAExp__RPM.txt', 'LUAD__protein_RPPA.txt', 'LUAD__geneExp.txt']
+````
+
+````{tab} From DataFrame
+If your workflow already produced a dataframe, you can encapsulate it directly with {class}`openomics.transcriptomics.Expression`.
+
+```python
+import pandas as pd
+import numpy as np
+from openomics.multiomics import MessengerRNA
+
+# A random dataframe of microRNA gene_id's.
+df = pd.DataFrame(data={"ENSG00000194717": np.random.rand(5),
+                        "ENSG00000198973": np.random.rand(5),
+                        "ENSG00000198974": np.random.rand(5),
+                        "ENSG00000198975": np.random.rand(5),
+                        "ENSG00000198976": np.random.rand(5),
+                        "ENSG00000198982": np.random.rand(5),
+                        "ENSG00000198983": np.random.rand(5)},
+                  index=range(5))
+mrna = MessengerRNA(df, transpose=False, sample_level="sample_id")
+```
+````
+---
+To access the {class}`DataFrame`, simply use {obj}`mrna.expressions`:
+```python
+print(mrna.expressions)
+```
+<div>
+<style scoped>
+    .dataframe tbody tr th:only-of-type {
+        vertical-align: middle;
+    }
+
+    .dataframe tbody tr th {
+        vertical-align: top;
+    }
+
+    .dataframe thead th {
+        text-align: right;
+    }
+
+</style>
+<table border="1" class="dataframe">
+  <thead>
+    <tr style="text-align: right;">
+      <th>GeneSymbol</th>
+      <th>A1BG</th>
+      <th>A1BG-AS1</th>
+      <th>A1CF</th>
+      <th>A2M</th>
+    </tr>
+    <tr>
+      <th>sample_index</th>
+      <th></th>
+      <th></th>
+      <th></th>
+      <th></th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>TCGA-05-4244-01A-01R-1107-07</th>
+      <td>26.0302</td>
+      <td>36.7711</td>
+      <td>0.000</td>
+      <td>9844.7858</td>
+    </tr>
+    <tr>
+      <th>TCGA-05-4249-01A-01R-1107-07</th>
+      <td>120.1349</td>
+      <td>132.1439</td>
+      <td>0.322</td>
+      <td>25712.6617</td>
+    </tr>
+  </tbody>
+</table>
+</div>
+
+<br/>
+
+## Creating a multi-omics dataset
+
+With multiple single-omics, each with different sets of genes and samples, you can use the {class}`openomics.MultiOmics` to integrate them.
+
+```{code-block} python
+from openomics.multiomics import MultiOmics, MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein
+
+path = "https://raw.githubusercontent.com/JonnyTran/OpenOmics/master/tests/data/TCGA_LUAD/"
+
+# Load each expression dataframe
+mRNA = MessengerRNA(path+"LUAD__geneExp.txt",
+    transpose=True,
+    usecols="GeneSymbol|TCGA",
+    gene_index="GeneSymbol")
+miRNA = MicroRNA(path+"LUAD__miRNAExp__RPM.txt",
+    transpose=True,
+    usecols="GeneSymbol|TCGA",
+    gene_index="GeneSymbol")
+lncRNA = LncRNA(path+"TCGA-rnaexpr.tsv",
+    transpose=True,
+    usecols="Gene_ID|TCGA",
+    gene_index="Gene_ID")
+som = SomaticMutation(path+"LUAD__somaticMutation_geneLevel.txt",
+    transpose=True,
+    usecols="GeneSymbol|TCGA",
+    gene_index="GeneSymbol")
+pro = Protein(path+"protein_RPPA.txt",
+    transpose=True,
+    usecols="GeneSymbol|TCGA",
+    gene_index="GeneSymbol")
+
+# Create an integrated MultiOmics dataset
+luad_data = MultiOmics(cohort_name="LUAD", omics_data=[mRNA, mRNA, lncRNA, som, pro])
+# You can also add individual -omics one at a time `luad_data.add_omic(mRNA)`
+
+luad_data.build_samples()
+```
+The `luad_data` is a {class}`MultiOmics` object builds the samples list from all the samples given in each -omics data.
+
+> MessengerRNA (576, 20472)
+> MicroRNA (494, 1870)
+> LncRNA (546, 12727)
+> SomaticMutation (587, 21070)
+> Protein (364, 154)
+
+To access individual -omics data within `luad_data`, such as the {obj}`mRNA`, simply use the `.` accessor with the class name {class}`MessengerRNA`:
+```python
+luad_data.MessengerRNA
+# or
+luad_data.data["MessengerRNA"]
+```
+
+<br/>
+
+## Adding clinical data as sample attributes
+
+When sample attributes are provided for the study cohort, load it as a data table with the {class}`openomics.clinical.ClinicalData`, then add it to the {class}`openomics.multiomics.MultiOmics` dataset to enable querying for subsets of samples across the multi-omics.
+
+```python
+from openomics import ClinicalData
+
+clinical = ClinicalData(
+    "https://raw.githubusercontent.com/JonnyTran/OpenOmics/master/tests/data/TCGA_LUAD/nationwidechildrens.org_clinical_patient_luad.txt",
+    patient_index="bcr_patient_barcode")
+
+luad_data.add_clinical_data(clinical)
+
+luad_data.clinical.patient
+```
+
+<div>
+<style scoped>
+    .dataframe tbody tr th:only-of-type {
+        vertical-align: middle;
+    }
+
+    .dataframe tbody tr th {
+        vertical-align: top;
+    }
+
+    .dataframe thead th {
+        text-align: right;
+    }
+
+</style>
+<table border="1" class="dataframe">
+  <thead>
+    <tr style="text-align: right;">
+      <th></th>
+      <th>bcr_patient_uuid</th>
+      <th>form_completion_date</th>
+      <th>histologic_diagnosis</th>
+      <th>prospective_collection</th>
+      <th>retrospective_collection</th>
+    </tr>
+    <tr>
+      <th>bcr_patient_barcode</th>
+      <th></th>
+      <th></th>
+      <th></th>
+      <th></th>
+      <th></th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>TCGA-05-4244</th>
+      <td>34040b83-7e8a-4264-a551-b16621843e28</td>
+      <td>2010-7-22</td>
+      <td>Lung Adenocarcinoma</td>
+      <td>NO</td>
+      <td>YES</td>
+    </tr>
+    <tr>
+      <th>TCGA-05-4245</th>
+      <td>03d09c05-49ab-4ba6-a8d7-e7ccf71fafd2</td>
+      <td>2010-7-22</td>
+      <td>Lung Adenocarcinoma</td>
+      <td>NO</td>
+      <td>YES</td>
+    </tr>
+    <tr>
+      <th>TCGA-05-4249</th>
+      <td>4addf05f-3668-4b3f-a17f-c0227329ca52</td>
+      <td>2010-7-22</td>
+      <td>Lung Adenocarcinoma</td>
+      <td>NO</td>
+      <td>YES</td>
+    </tr>
+  </tbody>
+</table>
+</div>
+
+Note that in the clinical data table, `bcr_patient_barcode` is the column with `TCGA-XX-XXXX` patient IDs, which matches
+that of the `sample_index` index column in the `mrna.expressions` dataframe.
+
+````{note}
+In our `TCGA_LUAD` example, mismatches in the `bcr_patient_barcode` sample index of clinical dataframe may happen because the `sample_index` in `mRNA` may have a longer form `TCGA-XX-XXXX-XXX-XXX-XXXX-XX` that contain the samples number and aliquot ID's. To make them match, you can modify the index strings on-the-fly using the [Pandas's extensible API](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html):
+```python
+mRNA.expressions.index = mRNA.expressions.index.str.slice(0, 12) # Selects only the first 12 characters
+```
+````
+
+<br/>
+
+## Import an external database
+
+Next, we may want to annotate the genes list in our RNA-seq expression dataset with genomics annotation. To do so, we'd need to download annotations from the [GENCODE database](https://www.gencodegenes.org/), preprocess annotation files into a dataframe, and then match them with the genes in our dataset.
+
+OpenOmics provides a simple, hassle-free API to download the GENCODE annotation files via FTP with these steps:
+1. First, provide the base `path` of the FTP download server - usually found in the direct download link on GENCODE's website. Most of the time, selecting the right base `path` allows you to specify the specific species, genome assembly, and database version for your study.
+2. Secondly, use the `file_resources` dict parameter to select the data files and the file paths required to construct the annotation dataframe. For each entry in the `file_resources`, the key is the alias of the file required, and the value is the filename with the FTP base `path`.
+
+   For example, the entry `{"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz"}` indicates the GENCODE class to preprocess a `.gtf` file with the alias `"long_noncoding_RNAs.gtf"`, downloaded from the FTP path `ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.long_noncoding_RNAs.gtf.gz`
+
+   To see which file alias keys are required to construct a dataframe, refer to the docstring in {class}`openomics.database.sequence.GENCODE`.
+
+```python
+from openomics.database.sequence import GENCODE
+
+gencode = GENCODE(
+    path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
+    file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz",
+                    "basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz",
+                    "lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz", # lncRNA sequences
+                    "transcripts.fa": "gencode.v32.transcripts.fa.gz" # mRNA sequences
+                    },
+    blocksize='100MB', # if not null, then use partition the dataframe with Dask to this size and leverage out-of-core multiprocessing
+)
+```
+To access the attributes constructed from the combination of annotations `long_noncoding_RNAs.gtf` and `
+basic.annotation.gtf`, use:
+
+```python
+gencode.data
+```
+
+<div>
+<style scoped>
+    .dataframe tbody tr th:only-of-type {
+        vertical-align: middle;
+    }
+
+    .dataframe tbody tr th {
+        vertical-align: top;
+    }
+
+    .dataframe thead th {
+        text-align: right;
+    }
+
+</style>
+<table border="1" class="dataframe">
+  <thead>
+    <tr style="text-align: right;">
+      <th></th>
+      <th>gene_id</th>
+      <th>gene_name</th>
+      <th>index</th>
+      <th>seqname</th>
+      <th>source</th>
+      <th>feature</th>
+      <th>start</th>
+      <th>end</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>0</th>
+      <td>ENSG00000243485</td>
+      <td>MIR1302-2HG</td>
+      <td>0</td>
+      <td>chr1</td>
+      <td>HAVANA</td>
+      <td>gene</td>
+      <td>29554</td>
+      <td>31109</td>
+    </tr>
+    <tr>
+      <th>1</th>
+      <td>ENSG00000243485</td>
+      <td>MIR1302-2HG</td>
+      <td>1</td>
+      <td>chr1</td>
+      <td>HAVANA</td>
+      <td>transcript</td>
+      <td>29554</td>
+      <td>31097</td>
+    </tr>
+  </tbody>
+</table>
+</div>
+
+
+<br/>
+
+## Annotate your expression dataset with attributes
+With the annotation database, you can perform a join operation to add gene attributes to your {class}`openomics.transcriptomics.Expression` dataset. To annotate attributes for the `gene_id` list `mRNA.expression`, you must first select the corresponding column in `gencode.data` with matching `gene_id` keys. The following are code snippets for a variety of database types.
+
+````{tab} Genomics attributes
+```python
+luad_data.MessengerRNA.annotate_attributes(gencode,
+    index="gene_id",
+    columns=['gene_name', 'start', 'end', 'strand'] # Add these columns to the .annotations dataframe
+    )
+```
+
+````
+
+````{tab} Sequences
+```python
+luad_data.MessengerRNA.annotate_sequences(gencode,
+    index="gene_name",
+    agg_sequences="all", # Collect all sequences with the gene_name into a list
+    )
+```
+````
+
+````{tab} Disease Associations
+```python
+from openomics.database.disease import DisGeNet
+disgenet = DisGeNet(path="https://www.disgenet.org/static/disgenet_ap1/files/downloads/", curated=True)
+
+luad_data.MessengerRNA.annotate_diseases(disgenet, index="gene_name")
+```
+````
+
+---
+To view the resulting annotations dataframe, use:
+```python
+luad_data.MessengerRNA.annotations
+```
+
+
+For more detailed guide, refer to the [annotation interfaces API](../modules/openomics.annotate.md).