--- a +++ b/docs/usage/import-your-dataset.md @@ -0,0 +1,334 @@ +# Loading a multi-omics dataset + +Suppose you have your own -omics dataset(s) and you'd like to load them. One of OpenOmics's primary goal is to +encapsulate the data import process with one line of code along with a few parameters. Given any processed single-omic +dataset, the library loads the data as a tabular structure where rows correspond to observation samples and columns +correspond to measurements of different biomolecules. + +Import TCGA LUAD data included in tests dataset (preprocessed from TCGA-Assembler). It is located +at [tests/data/TCGA_LUAD](https://github.com/JonnyTran/OpenOmics/tree/master/tests/data/TCGA_LUAD). + +```{code-block} python +folder_path = "tests/data/TCGA_LUAD/" +``` + +Load the multiomics: Gene Expression, MicroRNA expression lncRNA expression, Copy Number Variation, Somatic Mutation, DNA Methylation, and Protein Expression data + +```{code-block} python +from openomics import MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein + +# Load each expression dataframe +mRNA = MessengerRNA(data=folder_path + "LUAD__geneExp.txt", + transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="gene_name") +miRNA = MicroRNA(data=folder_path + "LUAD__miRNAExp__RPM.txt", + transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="transcript_name") +lncRNA = LncRNA(data=folder_path + "TCGA-rnaexpr.tsv", + transpose=True, usecols="Gene_ID|TCGA", gene_index="Gene_ID", gene_level="gene_id") +som = SomaticMutation(data=folder_path + "LUAD__somaticMutation_geneLevel.txt", + transpose=True, usecols="GeneSymbol|TCGA", gene_index="gene_name") +pro = Protein(data=folder_path + "protein_RPPA.txt", + transpose=True, usecols="GeneSymbol|TCGA", gene_index="GeneSymbol", gene_level="protein_name") + +# Create an integrated MultiOmics dataset +luad_data = MultiOmics(cohort_name="LUAD") +luad_data.add_clinical_data( + clinical=folder_path + "nationwidechildrens.org_clinical_patient_luad.txt") + +luad_data.add_omic(mRNA) +luad_data.add_omic(miRNA) +luad_data.add_omic(lncRNA) +luad_data.add_omic(som) +luad_data.add_omic(pro) + +luad_data.build_samples() +``` + +Each data is stored as a Pandas DataFrame. Below are all the data imported for TCGA LUAD. For each, the first number represents the number of samples, the second number is the number of features. + +> PATIENTS (522, 5) + SAMPLES (1160, 6) + DRUGS (461, 4) + MessengerRNA (576, 20472) + SomaticMutation (587, 21070) + MicroRNA (494, 1870) + LncRNA (546, 12727) + Protein (364, 154) + +You may notice that in this dataset, the samples index (e.g. TCGA-XX-XXXX) across different omics does not match. It may +be necessary to change them to be 12 characters in total. + +```python +lncRNA.expressions.index = lncRNA.expressions.index.str.slice(-12, ) +miRNA.expressions.index = miRNA.expressions.index.str.slice(0, 12) +mRNA.expressions.index = mRNA.expressions.index.str.slice(0, 12) +som.expressions.index = som.expressions.index.str.slice(0, 12) +pro.expressions.index = pro.expressions.index.str.slice(0, 12) + +luad_data.build_samples() +luad_data.samples +``` +> Index(['TCGA-05-4244', 'TCGA-05-4249', 'TCGA-05-4250', 'TCGA-05-4382', + 'TCGA-05-4384', 'TCGA-05-4389', 'TCGA-05-4390', 'TCGA-05-4395', + 'TCGA-05-4396', 'TCGA-05-4397', ... + 'TCGA-NJ-A4YG', 'TCGA-NJ-A4YI', 'TCGA-NJ-A4YP', 'TCGA-NJ-A4YQ', + 'TCGA-NJ-A55A', 'TCGA-NJ-A55O', 'TCGA-NJ-A55R', 'TCGA-NJ-A7XG', + 'TCGA-O1-A52J', 'TCGA-S2-AA1A'], dtype='object', length=952) + +## Load single omics expressions for MessengerRNA, MicroRNA, LncRNA + +We instantiate the MessengerRNA, MicroRNA and LncRNA -omics expression data from `gtex.data`. Since the gene expression +were not seperated by RNA type, we use GENCODE and Ensembl gene annotations to filter the list of mRNA, miRNA, and +lncRNAs. + +```{code-block} python +from openomics import MessengerRNA, MicroRNA, LncRNA + +# Gene Expression +messengerRNA_id = gtex_transcripts_gene_id & pd.Index(gencode.data[gencode.data["gene_type"] == "protein_coding"]["gene_id"].unique()) + +messengerRNA = MessengerRNA(gtex_transcripts[gtex_transcripts["gene_id"].isin(messengerRNA_id)], + transpose=True, gene_index="gene_name", usecols=None, npartitions=4) + +# MicroRNA expression +microRNA_id = pd.Index(ensembl.data[ensembl.data["gene_biotype"] == "miRNA"]["gene_id"].unique()) & gtex_transcripts_gene_id + +microRNA = MicroRNA(gtex_transcripts[gtex_transcripts["gene_id"].isin(microRNA_id)], + gene_index="gene_id", transpose=True, usecols=None, ) + +# LncRNA expression +lncRNA_id = pd.Index(gencode.data[gencode.data["gene_type"] == "lncRNA"]["gene_id"].unique()) & gtex_transcripts_gene_id +lncRNA = LncRNA(gtex_transcripts[gtex_transcripts["gene_id"].isin(lncRNA_id)], + gene_index="gene_id", transpose=True, usecols=None, ) +``` + +## Create a MultiOmics dataset + +Now, we create a MultiOmics dataset object by combining the messengerRNA, microRNA, and lncRNA. + +```{code-block} python + from openomics import MultiOmics + + gtex_data = MultiOmics(cohort_name="GTEx Tissue Avg Expressions") + + gtex_data.add_omic(messengerRNA) + gtex_data.add_omic(microRNA) + gtex_data.add_omic(lncRNA) + + gtex_data.build_samples() +``` + +## Accessing clinical data +Each multi-omics and clinical data can be accessed through luad_data.data[], like: + +```{code-block} python +luad_data.data["PATIENTS"] +``` +<div> +<table border="1" class="dataframe"> + <thead> + <tr style="text-align: right;"> + <th></th> + <th>gender</th> + <th>race</th> + <th>histologic_subtype</th> + <th>pathologic_stage</th> + </tr> + <tr> + <th>bcr_patient_barcode</th> + <th></th> + <th></th> + <th></th> + <th></th> + </tr> + </thead> + <tbody> + <tr> + <th>TCGA-05-4244</th> + <td>MALE</td> + <td>NaN</td> + <td>Lung Adenocarcinoma- Not Otherwise Specified (...</td> + <td>Stage IV</td> + </tr> + <tr> + <th>TCGA-05-4245</th> + <td>MALE</td> + <td>NaN</td> + <td>Lung Adenocarcinoma- Not Otherwise Specified (...</td> + <td>Stage III</td> + </tr> + <tr> + <th>TCGA-05-4249</th> + <td>MALE</td> + <td>NaN</td> + <td>Lung Adenocarcinoma- Not Otherwise Specified (...</td> + <td>Stage I</td> + </tr> + <tr> + <th>TCGA-05-4250</th> + <td>FEMALE</td> + <td>NaN</td> + <td>Lung Adenocarcinoma- Not Otherwise Specified (...</td> + <td>Stage III</td> + </tr> + <tr> + <th>TCGA-05-4382</th> + <td>MALE</td> + <td>NaN</td> + <td>Lung Adenocarcinoma Mixed Subtype</td> + <td>Stage I</td> + </tr> + </tbody> +</table> +<p>522 rows × 4 columns</p> +</div> + + +```{code-block} python +luad_data.data["MessengerRNA"] +``` +<div> +<table border="1" class="dataframe"> + <thead> + <tr style="text-align: right;"> + <th>gene_name</th> + <th>A1BG</th> + <th>A1BG-AS1</th> + <th>A1CF</th> + <th>A2M</th> + <th>A2ML1</th> + <th>A4GALT</th> + <th>A4GNT</th> + <th>AAAS</th> + <th>AACS</th> + <th>AACSP1</th> + <th>...</th> + <th>ZXDA</th> + <th>ZXDB</th> + <th>ZXDC</th> + <th>ZYG11A</th> + <th>ZYG11B</th> + <th>ZYX</th> + <th>ZZEF1</th> + <th>ZZZ3</th> + <th>psiTPTE22</th> + <th>tAKR</th> + </tr> + </thead> + <tbody> + <tr> + <th>TCGA-05-4244-01A</th> + <td>4.756500</td> + <td>5.239211</td> + <td>0.000000</td> + <td>13.265291</td> + <td>0.431997</td> + <td>7.043317</td> + <td>1.033652</td> + <td>9.348765</td> + <td>9.652057</td> + <td>0.763921</td> + <td>...</td> + <td>5.350285</td> + <td>8.197321</td> + <td>9.907260</td> + <td>0.763921</td> + <td>10.088859</td> + <td>11.471139</td> + <td>9.768648</td> + <td>9.170597</td> + <td>2.932118</td> + <td>0.000000</td> + </tr> + <tr> + <th>TCGA-05-4249-01A</th> + <td>6.920471</td> + <td>7.056843</td> + <td>0.402722</td> + <td>14.650247</td> + <td>1.383939</td> + <td>9.178805</td> + <td>0.717123</td> + <td>9.241537</td> + <td>9.967223</td> + <td>0.000000</td> + <td>...</td> + <td>5.980428</td> + <td>8.950001</td> + <td>10.204971</td> + <td>4.411650</td> + <td>9.622978</td> + <td>11.199826</td> + <td>10.153700</td> + <td>9.433116</td> + <td>7.499637</td> + <td>0.000000</td> + </tr> + <tr> + <th>TCGA-05-4250-01A</th> + <td>5.696542</td> + <td>6.136327</td> + <td>0.000000</td> + <td>14.048541</td> + <td>0.000000</td> + <td>8.481646</td> + <td>0.996244</td> + <td>9.203535</td> + <td>9.560412</td> + <td>0.733962</td> + <td>...</td> + <td>5.931168</td> + <td>8.517334</td> + <td>9.722642</td> + <td>4.782796</td> + <td>8.895339</td> + <td>12.408981</td> + <td>10.194168</td> + <td>9.060342</td> + <td>2.867956</td> + <td>0.000000</td> + </tr> + <tr> + <th>TCGA-05-4382-01A</th> + <td>7.198727</td> + <td>6.809804</td> + <td>0.000000</td> + <td>14.509730</td> + <td>2.532591</td> + <td>9.117559</td> + <td>1.657045</td> + <td>9.251035</td> + <td>10.078124</td> + <td>1.860883</td> + <td>...</td> + <td>5.373036</td> + <td>8.441914</td> + <td>9.888267</td> + <td>6.041142</td> + <td>9.828389</td> + <td>12.725186</td> + <td>10.192589</td> + <td>9.376841</td> + <td>5.177029</td> + <td>0.000000</td> + </tr> + </tbody> +</table> +<p>576 rows × 20472 columns</p> +</div> + +## To match samples accross different multi-omics, use +```{code-block} python +luad_data.match_samples(modalities=["MicroRNA", "MessengerRNA"]) +``` + + Index(['TCGA-05-4384-01A', 'TCGA-05-4390-01A', 'TCGA-05-4396-01A', + 'TCGA-05-4405-01A', 'TCGA-05-4410-01A', 'TCGA-05-4415-01A', + 'TCGA-05-4417-01A', 'TCGA-05-4424-01A', 'TCGA-05-4425-01A', + 'TCGA-05-4427-01A', + ... + 'TCGA-NJ-A4YG-01A', 'TCGA-NJ-A4YI-01A', 'TCGA-NJ-A4YP-01A', + 'TCGA-NJ-A4YQ-01A', 'TCGA-NJ-A55A-01A', 'TCGA-NJ-A55O-01A', + 'TCGA-NJ-A55R-01A', 'TCGA-NJ-A7XG-01A', 'TCGA-O1-A52J-01A', + 'TCGA-S2-AA1A-01A'], + dtype='object', length=465) +