|
a |
|
b/docs/usage/getting-started.md |
|
|
1 |
# Getting started |
|
|
2 |
|
|
|
3 |
Welcome! This tutorial highlights the OpenOmics API’s core features; for in-depth details and conceptual guides, see the links within, or the documentation index which has links to use cases, and API reference sections. |
|
|
4 |
|
|
|
5 |
## Loading a single-omics dataframe |
|
|
6 |
|
|
|
7 |
Suppose you have a single-omics dataset and would like to load them as a dataframe. |
|
|
8 |
|
|
|
9 |
As an example, we use the `TGCA` Lung Adenocarcinoma dataset |
|
|
10 |
from [tests/data/TCGA_LUAD](https://github.com/JonnyTran/OpenOmics/tree/master/tests/data/TCGA_LUAD). Data tables are |
|
|
11 |
tab-delimited and have the following format: |
|
|
12 |
|
|
|
13 |
| GeneSymbol | EntrezID | TCGA-05-4244-01A-01R-1107-07 | TCGA-05-4249-01A-01R-1107-07 | ... | |
|
|
14 |
| ---------- | --------- | ---------------------------- | ---------------------------- | ---- | |
|
|
15 |
| A1BG | 100133144 | 10.8123 | 3.7927 | ... | |
|
|
16 |
| ⋮ | ⋮ | ⋮ | ⋮ | |
|
|
17 |
|
|
|
18 |
Depending on whether your data table is stored locally as a single file, splitted into multiple files, or was already a dataframe, you can load it using the class {class}`openomics.transcriptomics.Expression` or any of its subclasses. |
|
|
19 |
|
|
|
20 |
````{tab} From a single file |
|
|
21 |
If the dataset is a local file in a tabular format, OpenOmics can help you load them to Pandas dataframe. |
|
|
22 |
|
|
|
23 |
```{code-block} python |
|
|
24 |
from openomics.multiomics import MessengerRNA |
|
|
25 |
|
|
|
26 |
mrna = MessengerRNA( |
|
|
27 |
data="https://raw.githubusercontent.com/JonnyTran/OpenOmics/master/tests/data/TCGA_LUAD/LUAD__geneExp.txt", |
|
|
28 |
transpose=True, |
|
|
29 |
usecols="GeneSymbol|TCGA", # A regex that matches all column name with either "GeneSymbol" or "TCGA substring |
|
|
30 |
gene_index="GeneSymbol", # This column contains the gene index |
|
|
31 |
) |
|
|
32 |
``` |
|
|
33 |
|
|
|
34 |
One thing to pay attention is that the raw data file given is column-oriented where columns corresponds to samples, so we have use the argument `transpose=True` to convert to row-oriented. |
|
|
35 |
> MessengerRNA (576, 20472) |
|
|
36 |
```` |
|
|
37 |
|
|
|
38 |
````{tab} From multiple files (glob) |
|
|
39 |
If your dataset is large, it may be broken up into multiple files with a similar file name prefix/suffix. Assuming all the files have similar tabular format, OpenOmics can load all files and contruct an integrated data table using the memory-efficient Dask dataframe. |
|
|
40 |
|
|
|
41 |
```python |
|
|
42 |
from openomics.multiomics import MessengerRNA |
|
|
43 |
|
|
|
44 |
mrna = MessengerRNA("TCGA_LUAD/LUAD__*", # Files must be stored locally |
|
|
45 |
transpose=True, |
|
|
46 |
usecols="GeneSymbol|TCGA", |
|
|
47 |
gene_index="GeneSymbol") |
|
|
48 |
``` |
|
|
49 |
|
|
|
50 |
> INFO: Files matched: ['LUAD__miRNAExp__RPM.txt', 'LUAD__protein_RPPA.txt', 'LUAD__geneExp.txt'] |
|
|
51 |
```` |
|
|
52 |
|
|
|
53 |
````{tab} From DataFrame |
|
|
54 |
If your workflow already produced a dataframe, you can encapsulate it directly with {class}`openomics.transcriptomics.Expression`. |
|
|
55 |
|
|
|
56 |
```python |
|
|
57 |
import pandas as pd |
|
|
58 |
import numpy as np |
|
|
59 |
from openomics.multiomics import MessengerRNA |
|
|
60 |
|
|
|
61 |
# A random dataframe of microRNA gene_id's. |
|
|
62 |
df = pd.DataFrame(data={"ENSG00000194717": np.random.rand(5), |
|
|
63 |
"ENSG00000198973": np.random.rand(5), |
|
|
64 |
"ENSG00000198974": np.random.rand(5), |
|
|
65 |
"ENSG00000198975": np.random.rand(5), |
|
|
66 |
"ENSG00000198976": np.random.rand(5), |
|
|
67 |
"ENSG00000198982": np.random.rand(5), |
|
|
68 |
"ENSG00000198983": np.random.rand(5)}, |
|
|
69 |
index=range(5)) |
|
|
70 |
mrna = MessengerRNA(df, transpose=False, sample_level="sample_id") |
|
|
71 |
``` |
|
|
72 |
```` |
|
|
73 |
--- |
|
|
74 |
To access the {class}`DataFrame`, simply use {obj}`mrna.expressions`: |
|
|
75 |
```python |
|
|
76 |
print(mrna.expressions) |
|
|
77 |
``` |
|
|
78 |
<div> |
|
|
79 |
<style scoped> |
|
|
80 |
.dataframe tbody tr th:only-of-type { |
|
|
81 |
vertical-align: middle; |
|
|
82 |
} |
|
|
83 |
|
|
|
84 |
.dataframe tbody tr th { |
|
|
85 |
vertical-align: top; |
|
|
86 |
} |
|
|
87 |
|
|
|
88 |
.dataframe thead th { |
|
|
89 |
text-align: right; |
|
|
90 |
} |
|
|
91 |
|
|
|
92 |
</style> |
|
|
93 |
<table border="1" class="dataframe"> |
|
|
94 |
<thead> |
|
|
95 |
<tr style="text-align: right;"> |
|
|
96 |
<th>GeneSymbol</th> |
|
|
97 |
<th>A1BG</th> |
|
|
98 |
<th>A1BG-AS1</th> |
|
|
99 |
<th>A1CF</th> |
|
|
100 |
<th>A2M</th> |
|
|
101 |
</tr> |
|
|
102 |
<tr> |
|
|
103 |
<th>sample_index</th> |
|
|
104 |
<th></th> |
|
|
105 |
<th></th> |
|
|
106 |
<th></th> |
|
|
107 |
<th></th> |
|
|
108 |
</tr> |
|
|
109 |
</thead> |
|
|
110 |
<tbody> |
|
|
111 |
<tr> |
|
|
112 |
<th>TCGA-05-4244-01A-01R-1107-07</th> |
|
|
113 |
<td>26.0302</td> |
|
|
114 |
<td>36.7711</td> |
|
|
115 |
<td>0.000</td> |
|
|
116 |
<td>9844.7858</td> |
|
|
117 |
</tr> |
|
|
118 |
<tr> |
|
|
119 |
<th>TCGA-05-4249-01A-01R-1107-07</th> |
|
|
120 |
<td>120.1349</td> |
|
|
121 |
<td>132.1439</td> |
|
|
122 |
<td>0.322</td> |
|
|
123 |
<td>25712.6617</td> |
|
|
124 |
</tr> |
|
|
125 |
</tbody> |
|
|
126 |
</table> |
|
|
127 |
</div> |
|
|
128 |
|
|
|
129 |
<br/> |
|
|
130 |
|
|
|
131 |
## Creating a multi-omics dataset |
|
|
132 |
|
|
|
133 |
With multiple single-omics, each with different sets of genes and samples, you can use the {class}`openomics.MultiOmics` to integrate them. |
|
|
134 |
|
|
|
135 |
```{code-block} python |
|
|
136 |
from openomics.multiomics import MultiOmics, MessengerRNA, MicroRNA, LncRNA, SomaticMutation, Protein |
|
|
137 |
|
|
|
138 |
path = "https://raw.githubusercontent.com/JonnyTran/OpenOmics/master/tests/data/TCGA_LUAD/" |
|
|
139 |
|
|
|
140 |
# Load each expression dataframe |
|
|
141 |
mRNA = MessengerRNA(path+"LUAD__geneExp.txt", |
|
|
142 |
transpose=True, |
|
|
143 |
usecols="GeneSymbol|TCGA", |
|
|
144 |
gene_index="GeneSymbol") |
|
|
145 |
miRNA = MicroRNA(path+"LUAD__miRNAExp__RPM.txt", |
|
|
146 |
transpose=True, |
|
|
147 |
usecols="GeneSymbol|TCGA", |
|
|
148 |
gene_index="GeneSymbol") |
|
|
149 |
lncRNA = LncRNA(path+"TCGA-rnaexpr.tsv", |
|
|
150 |
transpose=True, |
|
|
151 |
usecols="Gene_ID|TCGA", |
|
|
152 |
gene_index="Gene_ID") |
|
|
153 |
som = SomaticMutation(path+"LUAD__somaticMutation_geneLevel.txt", |
|
|
154 |
transpose=True, |
|
|
155 |
usecols="GeneSymbol|TCGA", |
|
|
156 |
gene_index="GeneSymbol") |
|
|
157 |
pro = Protein(path+"protein_RPPA.txt", |
|
|
158 |
transpose=True, |
|
|
159 |
usecols="GeneSymbol|TCGA", |
|
|
160 |
gene_index="GeneSymbol") |
|
|
161 |
|
|
|
162 |
# Create an integrated MultiOmics dataset |
|
|
163 |
luad_data = MultiOmics(cohort_name="LUAD", omics_data=[mRNA, mRNA, lncRNA, som, pro]) |
|
|
164 |
# You can also add individual -omics one at a time `luad_data.add_omic(mRNA)` |
|
|
165 |
|
|
|
166 |
luad_data.build_samples() |
|
|
167 |
``` |
|
|
168 |
The `luad_data` is a {class}`MultiOmics` object builds the samples list from all the samples given in each -omics data. |
|
|
169 |
|
|
|
170 |
> MessengerRNA (576, 20472) |
|
|
171 |
> MicroRNA (494, 1870) |
|
|
172 |
> LncRNA (546, 12727) |
|
|
173 |
> SomaticMutation (587, 21070) |
|
|
174 |
> Protein (364, 154) |
|
|
175 |
|
|
|
176 |
To access individual -omics data within `luad_data`, such as the {obj}`mRNA`, simply use the `.` accessor with the class name {class}`MessengerRNA`: |
|
|
177 |
```python |
|
|
178 |
luad_data.MessengerRNA |
|
|
179 |
# or |
|
|
180 |
luad_data.data["MessengerRNA"] |
|
|
181 |
``` |
|
|
182 |
|
|
|
183 |
<br/> |
|
|
184 |
|
|
|
185 |
## Adding clinical data as sample attributes |
|
|
186 |
|
|
|
187 |
When sample attributes are provided for the study cohort, load it as a data table with the {class}`openomics.clinical.ClinicalData`, then add it to the {class}`openomics.multiomics.MultiOmics` dataset to enable querying for subsets of samples across the multi-omics. |
|
|
188 |
|
|
|
189 |
```python |
|
|
190 |
from openomics import ClinicalData |
|
|
191 |
|
|
|
192 |
clinical = ClinicalData( |
|
|
193 |
"https://raw.githubusercontent.com/JonnyTran/OpenOmics/master/tests/data/TCGA_LUAD/nationwidechildrens.org_clinical_patient_luad.txt", |
|
|
194 |
patient_index="bcr_patient_barcode") |
|
|
195 |
|
|
|
196 |
luad_data.add_clinical_data(clinical) |
|
|
197 |
|
|
|
198 |
luad_data.clinical.patient |
|
|
199 |
``` |
|
|
200 |
|
|
|
201 |
<div> |
|
|
202 |
<style scoped> |
|
|
203 |
.dataframe tbody tr th:only-of-type { |
|
|
204 |
vertical-align: middle; |
|
|
205 |
} |
|
|
206 |
|
|
|
207 |
.dataframe tbody tr th { |
|
|
208 |
vertical-align: top; |
|
|
209 |
} |
|
|
210 |
|
|
|
211 |
.dataframe thead th { |
|
|
212 |
text-align: right; |
|
|
213 |
} |
|
|
214 |
|
|
|
215 |
</style> |
|
|
216 |
<table border="1" class="dataframe"> |
|
|
217 |
<thead> |
|
|
218 |
<tr style="text-align: right;"> |
|
|
219 |
<th></th> |
|
|
220 |
<th>bcr_patient_uuid</th> |
|
|
221 |
<th>form_completion_date</th> |
|
|
222 |
<th>histologic_diagnosis</th> |
|
|
223 |
<th>prospective_collection</th> |
|
|
224 |
<th>retrospective_collection</th> |
|
|
225 |
</tr> |
|
|
226 |
<tr> |
|
|
227 |
<th>bcr_patient_barcode</th> |
|
|
228 |
<th></th> |
|
|
229 |
<th></th> |
|
|
230 |
<th></th> |
|
|
231 |
<th></th> |
|
|
232 |
<th></th> |
|
|
233 |
</tr> |
|
|
234 |
</thead> |
|
|
235 |
<tbody> |
|
|
236 |
<tr> |
|
|
237 |
<th>TCGA-05-4244</th> |
|
|
238 |
<td>34040b83-7e8a-4264-a551-b16621843e28</td> |
|
|
239 |
<td>2010-7-22</td> |
|
|
240 |
<td>Lung Adenocarcinoma</td> |
|
|
241 |
<td>NO</td> |
|
|
242 |
<td>YES</td> |
|
|
243 |
</tr> |
|
|
244 |
<tr> |
|
|
245 |
<th>TCGA-05-4245</th> |
|
|
246 |
<td>03d09c05-49ab-4ba6-a8d7-e7ccf71fafd2</td> |
|
|
247 |
<td>2010-7-22</td> |
|
|
248 |
<td>Lung Adenocarcinoma</td> |
|
|
249 |
<td>NO</td> |
|
|
250 |
<td>YES</td> |
|
|
251 |
</tr> |
|
|
252 |
<tr> |
|
|
253 |
<th>TCGA-05-4249</th> |
|
|
254 |
<td>4addf05f-3668-4b3f-a17f-c0227329ca52</td> |
|
|
255 |
<td>2010-7-22</td> |
|
|
256 |
<td>Lung Adenocarcinoma</td> |
|
|
257 |
<td>NO</td> |
|
|
258 |
<td>YES</td> |
|
|
259 |
</tr> |
|
|
260 |
</tbody> |
|
|
261 |
</table> |
|
|
262 |
</div> |
|
|
263 |
|
|
|
264 |
Note that in the clinical data table, `bcr_patient_barcode` is the column with `TCGA-XX-XXXX` patient IDs, which matches |
|
|
265 |
that of the `sample_index` index column in the `mrna.expressions` dataframe. |
|
|
266 |
|
|
|
267 |
````{note} |
|
|
268 |
In our `TCGA_LUAD` example, mismatches in the `bcr_patient_barcode` sample index of clinical dataframe may happen because the `sample_index` in `mRNA` may have a longer form `TCGA-XX-XXXX-XXX-XXX-XXXX-XX` that contain the samples number and aliquot ID's. To make them match, you can modify the index strings on-the-fly using the [Pandas's extensible API](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html): |
|
|
269 |
```python |
|
|
270 |
mRNA.expressions.index = mRNA.expressions.index.str.slice(0, 12) # Selects only the first 12 characters |
|
|
271 |
``` |
|
|
272 |
```` |
|
|
273 |
|
|
|
274 |
<br/> |
|
|
275 |
|
|
|
276 |
## Import an external database |
|
|
277 |
|
|
|
278 |
Next, we may want to annotate the genes list in our RNA-seq expression dataset with genomics annotation. To do so, we'd need to download annotations from the [GENCODE database](https://www.gencodegenes.org/), preprocess annotation files into a dataframe, and then match them with the genes in our dataset. |
|
|
279 |
|
|
|
280 |
OpenOmics provides a simple, hassle-free API to download the GENCODE annotation files via FTP with these steps: |
|
|
281 |
1. First, provide the base `path` of the FTP download server - usually found in the direct download link on GENCODE's website. Most of the time, selecting the right base `path` allows you to specify the specific species, genome assembly, and database version for your study. |
|
|
282 |
2. Secondly, use the `file_resources` dict parameter to select the data files and the file paths required to construct the annotation dataframe. For each entry in the `file_resources`, the key is the alias of the file required, and the value is the filename with the FTP base `path`. |
|
|
283 |
|
|
|
284 |
For example, the entry `{"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz"}` indicates the GENCODE class to preprocess a `.gtf` file with the alias `"long_noncoding_RNAs.gtf"`, downloaded from the FTP path `ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.long_noncoding_RNAs.gtf.gz` |
|
|
285 |
|
|
|
286 |
To see which file alias keys are required to construct a dataframe, refer to the docstring in {class}`openomics.database.sequence.GENCODE`. |
|
|
287 |
|
|
|
288 |
```python |
|
|
289 |
from openomics.database.sequence import GENCODE |
|
|
290 |
|
|
|
291 |
gencode = GENCODE( |
|
|
292 |
path="ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/", |
|
|
293 |
file_resources={"long_noncoding_RNAs.gtf": "gencode.v32.long_noncoding_RNAs.gtf.gz", |
|
|
294 |
"basic.annotation.gtf": "gencode.v32.basic.annotation.gtf.gz", |
|
|
295 |
"lncRNA_transcripts.fa": "gencode.v32.lncRNA_transcripts.fa.gz", # lncRNA sequences |
|
|
296 |
"transcripts.fa": "gencode.v32.transcripts.fa.gz" # mRNA sequences |
|
|
297 |
}, |
|
|
298 |
blocksize='100MB', # if not null, then use partition the dataframe with Dask to this size and leverage out-of-core multiprocessing |
|
|
299 |
) |
|
|
300 |
``` |
|
|
301 |
To access the attributes constructed from the combination of annotations `long_noncoding_RNAs.gtf` and ` |
|
|
302 |
basic.annotation.gtf`, use: |
|
|
303 |
|
|
|
304 |
```python |
|
|
305 |
gencode.data |
|
|
306 |
``` |
|
|
307 |
|
|
|
308 |
<div> |
|
|
309 |
<style scoped> |
|
|
310 |
.dataframe tbody tr th:only-of-type { |
|
|
311 |
vertical-align: middle; |
|
|
312 |
} |
|
|
313 |
|
|
|
314 |
.dataframe tbody tr th { |
|
|
315 |
vertical-align: top; |
|
|
316 |
} |
|
|
317 |
|
|
|
318 |
.dataframe thead th { |
|
|
319 |
text-align: right; |
|
|
320 |
} |
|
|
321 |
|
|
|
322 |
</style> |
|
|
323 |
<table border="1" class="dataframe"> |
|
|
324 |
<thead> |
|
|
325 |
<tr style="text-align: right;"> |
|
|
326 |
<th></th> |
|
|
327 |
<th>gene_id</th> |
|
|
328 |
<th>gene_name</th> |
|
|
329 |
<th>index</th> |
|
|
330 |
<th>seqname</th> |
|
|
331 |
<th>source</th> |
|
|
332 |
<th>feature</th> |
|
|
333 |
<th>start</th> |
|
|
334 |
<th>end</th> |
|
|
335 |
</tr> |
|
|
336 |
</thead> |
|
|
337 |
<tbody> |
|
|
338 |
<tr> |
|
|
339 |
<th>0</th> |
|
|
340 |
<td>ENSG00000243485</td> |
|
|
341 |
<td>MIR1302-2HG</td> |
|
|
342 |
<td>0</td> |
|
|
343 |
<td>chr1</td> |
|
|
344 |
<td>HAVANA</td> |
|
|
345 |
<td>gene</td> |
|
|
346 |
<td>29554</td> |
|
|
347 |
<td>31109</td> |
|
|
348 |
</tr> |
|
|
349 |
<tr> |
|
|
350 |
<th>1</th> |
|
|
351 |
<td>ENSG00000243485</td> |
|
|
352 |
<td>MIR1302-2HG</td> |
|
|
353 |
<td>1</td> |
|
|
354 |
<td>chr1</td> |
|
|
355 |
<td>HAVANA</td> |
|
|
356 |
<td>transcript</td> |
|
|
357 |
<td>29554</td> |
|
|
358 |
<td>31097</td> |
|
|
359 |
</tr> |
|
|
360 |
</tbody> |
|
|
361 |
</table> |
|
|
362 |
</div> |
|
|
363 |
|
|
|
364 |
|
|
|
365 |
<br/> |
|
|
366 |
|
|
|
367 |
## Annotate your expression dataset with attributes |
|
|
368 |
With the annotation database, you can perform a join operation to add gene attributes to your {class}`openomics.transcriptomics.Expression` dataset. To annotate attributes for the `gene_id` list `mRNA.expression`, you must first select the corresponding column in `gencode.data` with matching `gene_id` keys. The following are code snippets for a variety of database types. |
|
|
369 |
|
|
|
370 |
````{tab} Genomics attributes |
|
|
371 |
```python |
|
|
372 |
luad_data.MessengerRNA.annotate_attributes(gencode, |
|
|
373 |
index="gene_id", |
|
|
374 |
columns=['gene_name', 'start', 'end', 'strand'] # Add these columns to the .annotations dataframe |
|
|
375 |
) |
|
|
376 |
``` |
|
|
377 |
|
|
|
378 |
```` |
|
|
379 |
|
|
|
380 |
````{tab} Sequences |
|
|
381 |
```python |
|
|
382 |
luad_data.MessengerRNA.annotate_sequences(gencode, |
|
|
383 |
index="gene_name", |
|
|
384 |
agg_sequences="all", # Collect all sequences with the gene_name into a list |
|
|
385 |
) |
|
|
386 |
``` |
|
|
387 |
```` |
|
|
388 |
|
|
|
389 |
````{tab} Disease Associations |
|
|
390 |
```python |
|
|
391 |
from openomics.database.disease import DisGeNet |
|
|
392 |
disgenet = DisGeNet(path="https://www.disgenet.org/static/disgenet_ap1/files/downloads/", curated=True) |
|
|
393 |
|
|
|
394 |
luad_data.MessengerRNA.annotate_diseases(disgenet, index="gene_name") |
|
|
395 |
``` |
|
|
396 |
```` |
|
|
397 |
|
|
|
398 |
--- |
|
|
399 |
To view the resulting annotations dataframe, use: |
|
|
400 |
```python |
|
|
401 |
luad_data.MessengerRNA.annotations |
|
|
402 |
``` |
|
|
403 |
|
|
|
404 |
|
|
|
405 |
For more detailed guide, refer to the [annotation interfaces API](../modules/openomics.annotate.md). |