Switch to unified view

a b/docs/usage/getting-started.md
1
# Getting started
2
3
Welcome! This tutorial highlights the OpenOmics API’s core features; for in-depth details and conceptual guides, see the links within, or the documentation index which has links to use cases, and API reference sections.
4
5
## Loading a single-omics dataframe
6
7
Suppose you have a single-omics dataset and would like to load them as a dataframe.
8
9
As an example, we use the `TGCA` Lung Adenocarcinoma dataset
10
from [tests/data/TCGA_LUAD](https://github.com/JonnyTran/OpenOmics/tree/master/tests/data/TCGA_LUAD). Data tables are
11
tab-delimited and have the following format:
12
13
| GeneSymbol | EntrezID  | TCGA-05-4244-01A-01R-1107-07 | TCGA-05-4249-01A-01R-1107-07 | ...  |
14
| ---------- | --------- | ---------------------------- | ---------------------------- | ---- |
15
| A1BG       | 100133144 | 10.8123                      | 3.7927                       | ...  |
16
| ⋮ | ⋮ | ⋮ | ⋮ |
17
18
Depending on whether your data table is stored locally as a single file, splitted into multiple files, or was already a dataframe, you can load it using the class {class}`openomics.transcriptomics.Expression` or any of its subclasses.
19
20
````{tab} From a single file
21
If the dataset is a local file in a tabular format, OpenOmics can help you load them to Pandas dataframe.
22
23
```{code-block} python
24
from openomics.multiomics import MessengerRNA
25
26
mrna = MessengerRNA(
27
    data="https://raw.githubusercontent.com/JonnyTran/OpenOmics/master/tests/data/TCGA_LUAD/LUAD__geneExp.txt",
28
    transpose=True,
29
    usecols="GeneSymbol|TCGA", # A regex that matches all column name with either "GeneSymbol" or "TCGA substring
30
    gene_index="GeneSymbol", # This column contains the gene index
31
    )
32
```
33
34
One thing to pay attention is that the raw data file given is column-oriented where columns corresponds to samples, so we have use the argument `transpose=True` to convert to row-oriented.
35
> MessengerRNA (576, 20472)
36
````
37
38
````{tab} From multiple files (glob)
39
If your dataset is large, it may be broken up into multiple files with a similar file name prefix/suffix. Assuming all the files have similar tabular format, OpenOmics can load all files and contruct an integrated data table using the memory-efficient Dask dataframe.
40
41
```python
42
from openomics.multiomics import MessengerRNA
43
44
mrna = MessengerRNA("TCGA_LUAD/LUAD__*", # Files must be stored locally
45
                    transpose=True,
46
                    usecols="GeneSymbol|TCGA",
47
                    gene_index="GeneSymbol")
48
```
49
50
> INFO: Files matched: ['LUAD__miRNAExp__RPM.txt', 'LUAD__protein_RPPA.txt', 'LUAD__geneExp.txt']
51
````
52
53
````{tab} From DataFrame
54
If your workflow already produced a dataframe, you can encapsulate it directly with {class}`openomics.transcriptomics.Expression`.
55
56
```python
57
import pandas as pd
58
import numpy as np
59
from openomics.multiomics import MessengerRNA
60
61
# A random dataframe of microRNA gene_id's.
62
df = pd.DataFrame(data={"ENSG00000194717": np.random.rand(5),
63
                        "ENSG00000198973": np.random.rand(5),
64
                        "ENSG00000198974": np.random.rand(5),
65
                        "ENSG00000198975": np.random.rand(5),
66
                        "ENSG00000198976": np.random.rand(5),
67
                        "ENSG00000198982": np.random.rand(5),
68
                        "ENSG00000198983": np.random.rand(5)},
69
                  index=range(5))
70
mrna = MessengerRNA(df, transpose=False, sample_level="sample_id")
71
```
72
````
73
---
74
To access the {class}`DataFrame`, simply use {obj}`mrna.expressions`:
75
```python
76
print(mrna.expressions)
77
```
78
<div>
79
<style scoped>
80
    .dataframe tbody tr th:only-of-type {
81
        vertical-align: middle;
82
    }
83
84
    .dataframe tbody tr th {
85
        vertical-align: top;
86
    }
87
88
    .dataframe thead th {
89
        text-align: right;
90
    }
91
92
</style>
93
<table border="1" class="dataframe">
94
  <thead>
95
    <tr style="text-align: right;">
96
      <th>GeneSymbol</th>
97
      <th>A1BG</th>
98
      <th>A1BG-AS1</th>
99
      <th>A1CF</th>
100
      <th>A2M</th>
101
    </tr>
102
    <tr>
103
      <th>sample_index</th>
104
      <th></th>
105
      <th></th>
106
      <th></th>
107
      <th></th>
108
    </tr>
109
  </thead>
110
  <tbody>
111
    <tr>
112
      <th>TCGA-05-4244-01A-01R-1107-07</th>
113
      <td>26.0302</td>
114
      <td>36.7711</td>
115
      <td>0.000</td>
116
      <td>9844.7858</td>
117
    </tr>
118
    <tr>
119
      <th>TCGA-05-4249-01A-01R-1107-07</th>
120
      <td>120.1349</td>
121
      <td>132.1439</td>
122
      <td>0.322</td>
123
      <td>25712.6617</td>
124
    </tr>
125
  </tbody>
126
</table>
127
</div>
128
129
<br/>
130
131
## Creating a multi-omics dataset
132
133
With multiple single-omics, each with different sets of genes and samples, you can use the {class}`openomics.MultiOmics` to integrate them.
134
135
```{code-block} python
136
from openomics.multiomics import MultiOmics, MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein
137
138
path = "https://raw.githubusercontent.com/JonnyTran/OpenOmics/master/tests/data/TCGA_LUAD/"
139
140
# Load each expression dataframe
141
mRNA = MessengerRNA(path+"LUAD__geneExp.txt",
142
    transpose=True,
143
    usecols="GeneSymbol|TCGA",
144
    gene_index="GeneSymbol")
145
miRNA = MicroRNA(path+"LUAD__miRNAExp__RPM.txt",
146
    transpose=True,
147
    usecols="GeneSymbol|TCGA",
148
    gene_index="GeneSymbol")
149
lncRNA = LncRNA(path+"TCGA-rnaexpr.tsv",
150
    transpose=True,
151
    usecols="Gene_ID|TCGA",
152
    gene_index="Gene_ID")
153
som = SomaticMutation(path+"LUAD__somaticMutation_geneLevel.txt",
154
    transpose=True,
155
    usecols="GeneSymbol|TCGA",
156
    gene_index="GeneSymbol")
157
pro = Protein(path+"protein_RPPA.txt",
158
    transpose=True,
159
    usecols="GeneSymbol|TCGA",
160
    gene_index="GeneSymbol")
161
162
# Create an integrated MultiOmics dataset
163
luad_data = MultiOmics(cohort_name="LUAD", omics_data=[mRNA, mRNA, lncRNA, som, pro])
164
# You can also add individual -omics one at a time `luad_data.add_omic(mRNA)`
165
166
luad_data.build_samples()
167
```
168
The `luad_data` is a {class}`MultiOmics` object builds the samples list from all the samples given in each -omics data.
169
170
> MessengerRNA (576, 20472)
171
> MicroRNA (494, 1870)
172
> LncRNA (546, 12727)
173
> SomaticMutation (587, 21070)
174
> Protein (364, 154)
175
176
To access individual -omics data within `luad_data`, such as the {obj}`mRNA`, simply use the `.` accessor with the class name {class}`MessengerRNA`:
177
```python
178
luad_data.MessengerRNA
179
# or
180
luad_data.data["MessengerRNA"]
181
```
182
183
<br/>
184
185
## Adding clinical data as sample attributes
186
187
When sample attributes are provided for the study cohort, load it as a data table with the {class}`openomics.clinical.ClinicalData`, then add it to the {class}`openomics.multiomics.MultiOmics` dataset to enable querying for subsets of samples across the multi-omics.
188
189
```python
190
from openomics import ClinicalData
191
192
clinical = ClinicalData(
193
    "https://raw.githubusercontent.com/JonnyTran/OpenOmics/master/tests/data/TCGA_LUAD/nationwidechildrens.org_clinical_patient_luad.txt",
194
    patient_index="bcr_patient_barcode")
195
196
luad_data.add_clinical_data(clinical)
197
198
luad_data.clinical.patient
199
```
200
201
<div>
202
<style scoped>
203
    .dataframe tbody tr th:only-of-type {
204
        vertical-align: middle;
205
    }
206
207
    .dataframe tbody tr th {
208
        vertical-align: top;
209
    }
210
211
    .dataframe thead th {
212
        text-align: right;
213
    }
214
215
</style>
216
<table border="1" class="dataframe">
217
  <thead>
218
    <tr style="text-align: right;">
219
      <th></th>
220
      <th>bcr_patient_uuid</th>
221
      <th>form_completion_date</th>
222
      <th>histologic_diagnosis</th>
223
      <th>prospective_collection</th>
224
      <th>retrospective_collection</th>
225
    </tr>
226
    <tr>
227
      <th>bcr_patient_barcode</th>
228
      <th></th>
229
      <th></th>
230
      <th></th>
231
      <th></th>
232
      <th></th>
233
    </tr>
234
  </thead>
235
  <tbody>
236
    <tr>
237
      <th>TCGA-05-4244</th>
238
      <td>34040b83-7e8a-4264-a551-b16621843e28</td>
239
      <td>2010-7-22</td>
240
      <td>Lung Adenocarcinoma</td>
241
      <td>NO</td>
242
      <td>YES</td>
243
    </tr>
244
    <tr>
245
      <th>TCGA-05-4245</th>
246
      <td>03d09c05-49ab-4ba6-a8d7-e7ccf71fafd2</td>
247
      <td>2010-7-22</td>
248
      <td>Lung Adenocarcinoma</td>
249
      <td>NO</td>
250
      <td>YES</td>
251
    </tr>
252
    <tr>
253
      <th>TCGA-05-4249</th>
254
      <td>4addf05f-3668-4b3f-a17f-c0227329ca52</td>
255
      <td>2010-7-22</td>
256
      <td>Lung Adenocarcinoma</td>
257
      <td>NO</td>
258
      <td>YES</td>
259
    </tr>
260
  </tbody>
261
</table>
262
</div>
263
264
Note that in the clinical data table, `bcr_patient_barcode` is the column with `TCGA-XX-XXXX` patient IDs, which matches
265
that of the `sample_index` index column in the `mrna.expressions` dataframe.
266
267
````{note}
268
In our `TCGA_LUAD` example, mismatches in the `bcr_patient_barcode` sample index of clinical dataframe may happen because the `sample_index` in `mRNA` may have a longer form `TCGA-XX-XXXX-XXX-XXX-XXXX-XX` that contain the samples number and aliquot ID's. To make them match, you can modify the index strings on-the-fly using the [Pandas's extensible API](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html):
269
```python
270
mRNA.expressions.index = mRNA.expressions.index.str.slice(0, 12) # Selects only the first 12 characters
271
```
272
````
273
274
<br/>
275
276
## Import an external database
277
278
Next, we may want to annotate the genes list in our RNA-seq expression dataset with genomics annotation. To do so, we'd need to download annotations from the [GENCODE database](https://www.gencodegenes.org/), preprocess annotation files into a dataframe, and then match them with the genes in our dataset.
279
280
OpenOmics provides a simple, hassle-free API to download the GENCODE annotation files via FTP with these steps:
281
1. First, provide the base `path` of the FTP download server - usually found in the direct download link on GENCODE's website. Most of the time, selecting the right base `path` allows you to specify the specific species, genome assembly, and database version for your study.
282
2. Secondly, use the `file_resources` dict parameter to select the data files and the file paths required to construct the annotation dataframe. For each entry in the `file_resources`, the key is the alias of the file required, and the value is the filename with the FTP base `path`.
283
284
   For example, the entry `{"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz"}` indicates the GENCODE class to preprocess a `.gtf` file with the alias `"long_noncoding_RNAs.gtf"`, downloaded from the FTP path `ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.long_noncoding_RNAs.gtf.gz`
285
286
   To see which file alias keys are required to construct a dataframe, refer to the docstring in {class}`openomics.database.sequence.GENCODE`.
287
288
```python
289
from openomics.database.sequence import GENCODE
290
291
gencode = GENCODE(
292
    path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/",
293
    file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz",
294
                    "basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz",
295
                    "lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz", # lncRNA sequences
296
                    "transcripts.fa": "gencode.v32.transcripts.fa.gz" # mRNA sequences
297
                    },
298
    blocksize='100MB', # if not null, then use partition the dataframe with Dask to this size and leverage out-of-core multiprocessing
299
)
300
```
301
To access the attributes constructed from the combination of annotations `long_noncoding_RNAs.gtf` and `
302
basic.annotation.gtf`, use:
303
304
```python
305
gencode.data
306
```
307
308
<div>
309
<style scoped>
310
    .dataframe tbody tr th:only-of-type {
311
        vertical-align: middle;
312
    }
313
314
    .dataframe tbody tr th {
315
        vertical-align: top;
316
    }
317
318
    .dataframe thead th {
319
        text-align: right;
320
    }
321
322
</style>
323
<table border="1" class="dataframe">
324
  <thead>
325
    <tr style="text-align: right;">
326
      <th></th>
327
      <th>gene_id</th>
328
      <th>gene_name</th>
329
      <th>index</th>
330
      <th>seqname</th>
331
      <th>source</th>
332
      <th>feature</th>
333
      <th>start</th>
334
      <th>end</th>
335
    </tr>
336
  </thead>
337
  <tbody>
338
    <tr>
339
      <th>0</th>
340
      <td>ENSG00000243485</td>
341
      <td>MIR1302-2HG</td>
342
      <td>0</td>
343
      <td>chr1</td>
344
      <td>HAVANA</td>
345
      <td>gene</td>
346
      <td>29554</td>
347
      <td>31109</td>
348
    </tr>
349
    <tr>
350
      <th>1</th>
351
      <td>ENSG00000243485</td>
352
      <td>MIR1302-2HG</td>
353
      <td>1</td>
354
      <td>chr1</td>
355
      <td>HAVANA</td>
356
      <td>transcript</td>
357
      <td>29554</td>
358
      <td>31097</td>
359
    </tr>
360
  </tbody>
361
</table>
362
</div>
363
364
365
<br/>
366
367
## Annotate your expression dataset with attributes
368
With the annotation database, you can perform a join operation to add gene attributes to your {class}`openomics.transcriptomics.Expression` dataset. To annotate attributes for the `gene_id` list `mRNA.expression`, you must first select the corresponding column in `gencode.data` with matching `gene_id` keys. The following are code snippets for a variety of database types.
369
370
````{tab} Genomics attributes
371
```python
372
luad_data.MessengerRNA.annotate_attributes(gencode,
373
    index="gene_id",
374
    columns=['gene_name', 'start', 'end', 'strand'] # Add these columns to the .annotations dataframe
375
    )
376
```
377
378
````
379
380
````{tab} Sequences
381
```python
382
luad_data.MessengerRNA.annotate_sequences(gencode,
383
    index="gene_name",
384
    agg_sequences="all", # Collect all sequences with the gene_name into a list
385
    )
386
```
387
````
388
389
````{tab} Disease Associations
390
```python
391
from openomics.database.disease import DisGeNet
392
disgenet = DisGeNet(path="https://www.disgenet.org/static/disgenet_ap1/files/downloads/", curated=True)
393
394
luad_data.MessengerRNA.annotate_diseases(disgenet, index="gene_name")
395
```
396
````
397
398
---
399
To view the resulting annotations dataframe, use:
400
```python
401
luad_data.MessengerRNA.annotations
402
```
403
404
405
For more detailed guide, refer to the [annotation interfaces API](../modules/openomics.annotate.md).