|
a |
|
b/inst/paper.md |
|
|
1 |
--- |
|
|
2 |
title: 'OpenOmics: A bioinformatics API to integrate multi-omics datasets and interface with public databases.' |
|
|
3 |
tags: |
|
|
4 |
- Python |
|
|
5 |
- bioinformatics |
|
|
6 |
- multiomics |
|
|
7 |
- data integration |
|
|
8 |
- big data |
|
|
9 |
authors: |
|
|
10 |
- name: Nhat C. Tran^[corresponding author] |
|
|
11 |
orcid: 0000-0002-2575-9633 |
|
|
12 |
affiliation: 1 |
|
|
13 |
- name: Jean X. Gao |
|
|
14 |
affiliation: 1 |
|
|
15 |
affiliations: |
|
|
16 |
- name: Department of Computer Science and Engineering, The University of Texas at Arlington |
|
|
17 |
index: 1 |
|
|
18 |
date: 25 January 2021 |
|
|
19 |
bibliography: paper.bib |
|
|
20 |
--- |
|
|
21 |
|
|
|
22 |
# Summary |
|
|
23 |
|
|
|
24 |
Leveraging large-scale multi-omics data is emerging as the primary approach for systemic research of human diseases and |
|
|
25 |
general biological processes. As data integration and feature engineering are the vital steps in these bioinformatics |
|
|
26 |
projects, there currently lacks a tool for standardized preprocessing of heterogeneous multi-omics and annotation data |
|
|
27 |
within the context of a clinical cohort. OpenOmics is a Python library for integrating heterogeneous multi-omics data |
|
|
28 |
and interfacing with popular public annotation databases, e.g., GENCODE, Ensembl, BioGRID. The library is designed to be |
|
|
29 |
highly flexible to allow the user to parameterize the construction of integrated datasets, interactive to assist complex |
|
|
30 |
data exploratory analyses, and scalable to facilitate working with large datasets on standard machines. In this paper, |
|
|
31 |
we demonstrate the software design choices to support the wide-ranging use cases of OpenOmics with the goal of |
|
|
32 |
maximizing usability and reproducibility of the data integration framework. |
|
|
33 |
|
|
|
34 |
# Statement of need |
|
|
35 |
|
|
|
36 |
Recent advances in sequencing technology and computational methods have enabled the means to generate large-scale, |
|
|
37 |
high-throughput multi-omics data [@lappalainen2013transcriptome], providing unprecedented research opportunities for |
|
|
38 |
cancer and other diseases. These methods have already been applied to a number of problems within bioinformatics, and |
|
|
39 |
indeed several integrative disease |
|
|
40 |
studies [@zhang2014proteogenomic; @cancer2014comprehensive; @ren2016integration; @hassan2020integration]. In addition to |
|
|
41 |
the genome-wide measurements of different genetic characterizations, the growing public knowledge-base of functional |
|
|
42 |
annotations [@rnacentral2016rnacentral; @derrien2012gencode], experimentally-verified |
|
|
43 |
interactions [@chou2015mirtarbase; @yuan2013npinter; @chou2017mirtarbase; @oughtred2019biogrid], and gene-disease |
|
|
44 |
associations [@huang2018hmdd; @pinero2016disgenet; @chen2012lncrnadisease] also provides the prior-knowledge essential |
|
|
45 |
for system-level analyses. Leveraging these data sources allow for a systematic investigation of disease mechanisms at |
|
|
46 |
multiple molecular and regulatory layers; however, such task remains nontrivial due to the complexity of multi-omics |
|
|
47 |
data. |
|
|
48 |
|
|
|
49 |
While researchers have developed several mature tools to access or analyze a particular single omic data |
|
|
50 |
type [@wolf2018scanpy; @stuart2019integrative], the current state of integrative data platforms for multi-omics data is |
|
|
51 |
lacking due to three reasons. First, pipelines for data integration carry out a sequential tasks that does not process |
|
|
52 |
multi-omics datasets holistically. Second, the vast size and heterogeneity of the data poses a challenge on the |
|
|
53 |
necessary data storage and computational processing. And third, implementations of data pipelines are close-ended for |
|
|
54 |
down-stream analysis or not conductive to data exploration use-cases. Additionally, there is currently a need for |
|
|
55 |
increased transparency in the process of multi-omics data integration, and a standardized data preprocessing strategy is |
|
|
56 |
important for the interpretation and exchange of bioinformatic projects. Currently, there exist very few systems that, |
|
|
57 |
on the one hand, supports standardized handling of multi-omics datasets but also allows to query the integrated dataset |
|
|
58 |
within the context of a clinical cohort. |
|
|
59 |
|
|
|
60 |
# Related works |
|
|
61 |
|
|
|
62 |
There are several existing platforms that aids in the integration of multi-omics data, such as Galaxy, Anduril, MixOmics |
|
|
63 |
and O-Miner. First, Galaxy [@boekel2015multi] and Anduril [@cervera2019anduril] are mature platforms and has an |
|
|
64 |
established workflow framework for genomic and transcriptomic data analysis. Galaxy contains hundreds of |
|
|
65 |
state-of-the-art tools of these core domains for processing and assembling high-throughput sequencing data. Second, |
|
|
66 |
MixOmics [@rohart2017mixomics] is an R library dedicated to the multivariate analysis of biological data sets with a |
|
|
67 |
specific focus on data exploration, dimension reduction and visualisation. Third, O-Miner [@sangaralingam2019multi] is |
|
|
68 |
web tool that provides a pipeline for analysis of both transcriptomic and genomic data starting from raw image files |
|
|
69 |
through in-depth bioinformatics analysis. However, as large-scale multi-omic data analysis demands continue to grow, the |
|
|
70 |
technologies and data analysis needs continually change to adapt with `big data`. For instance, the data manipulation |
|
|
71 |
required for multi-omics integration requires a multitude of complex operations, but the point and click interface given |
|
|
72 |
in existing Galaxy tools can be limiting or not computationally efficient. Although the MixOmics toolkit provides an R |
|
|
73 |
programming interface, it doesn't yet leverage high-performance distributed storage or computing resources. Finally, |
|
|
74 |
while O-Miner can perform end-to-end analysis in an integrated platform, its interim analysis results cannot be exported |
|
|
75 |
elsewhere for down-stream analysis. |
|
|
76 |
|
|
|
77 |
 |
|
|
78 |
|
|
|
79 |
# The OpenOmics library |
|
|
80 |
|
|
|
81 |
OpenOmics consists of two core modules: multi-omics integration and annotation interface. An overview visualization of |
|
|
82 |
the OpenOmics system architecture is provided in \autoref{architecture}. |
|
|
83 |
|
|
|
84 |
## Multi-omics integration |
|
|
85 |
|
|
|
86 |
Tabular data are everywhere in bioinformatics. To record expression quantifications, annotations, or variant calls, data |
|
|
87 |
are typically stored in various tabular-like formats, such as BED, GTF, MAF, and VCF, which can be preprocessed and |
|
|
88 |
normalized to row indexed formats. Given any processed single-omic dataset, the library generalizes the data as a |
|
|
89 |
tabular structure where rows correspond to observation samples and columns correspond to measurements of different |
|
|
90 |
biomolecules. The core functionality of the Multi-omics Integration module is to integrate the multiple single-omic |
|
|
91 |
datasets for the overlapping samples. By generating multi-omics data for the same set of samples, our tool can provide |
|
|
92 |
the necessary data structure to develop insights into the flow of biological information across multiple genome, |
|
|
93 |
epigenome, transcriptome, proteome, metabolome and phenome levels. The user can import and integrate the following |
|
|
94 |
supported omic types: |
|
|
95 |
|
|
|
96 |
- Genomics: single nucleotide variants (SNV), copy number variation (CNV) |
|
|
97 |
- Epigenomics: DNA methylation |
|
|
98 |
- Transcriptomics: RNA-Seq, miRNA expression, lncRNA expression, microarrays |
|
|
99 |
- Proteomics: reverse phase protein array (RPPA), iTRAQ |
|
|
100 |
|
|
|
101 |
After importing each single omics data, OpenOmics stores a Pandas Dataframe [@mckinney-proc-scipy-2010] that is flexible |
|
|
102 |
for a wide range of tabular operations. For instance, the user is presented with several functions for preprocessing of |
|
|
103 |
the expression quantifications to normalize, filter outliers, or reduce noise. |
|
|
104 |
|
|
|
105 |
Within a study cohort, the clinical characteristics are crucial for the study of a disease or biological phenomenon. The |
|
|
106 |
user can characterize the set of samples using the Clinical Data structure, which is comprised of two levels: Patient |
|
|
107 |
and Biospecimen. A Patient can have attribute fields on demographics, clinical diagnosis, disease progression, treatment |
|
|
108 |
responses, and survival outcomes. Typically, multi-omics data observations are captured at the Biospecimen level and |
|
|
109 |
each Patient can have multiple Biospecimens. OpenOmics tracks the ID's of biospecimens and the patient it belongs to, so |
|
|
110 |
the multi-omics data are organized in a hierarchical order to enable aggregated operations. |
|
|
111 |
|
|
|
112 |
## Annotation interface |
|
|
113 |
|
|
|
114 |
After importing and integrating the multi-omic data, the user can supplement their dataset with various annotation |
|
|
115 |
attributes from public data repositories such as GENCODE, Ensembl, and RNA Central. With just a few operations, the user |
|
|
116 |
can easily download a data repository of choice, select relevant attributes, and efficiently join a variable number of |
|
|
117 |
annotation columns to their genomics, transcriptomics, and proteomics data. The full list of databases and the |
|
|
118 |
availability of annotation attributes is listed in Table 1. |
|
|
119 |
|
|
|
120 |
For each public database, the Annotation Interface module provides a series of interfaces to perform specific importing, |
|
|
121 |
preprocessing, and annotation tasks. At the import step, the module can either fetch the database files via a |
|
|
122 |
file-transfer-protocol (ftp) URL or load a locally downloaded file. At this step, the user can specify the species, |
|
|
123 |
genome build, and version of the database by providing a ftp URL of choice. To streamline this process, the module |
|
|
124 |
automatically caches downloaded file to disk, uncompress them, and handle different file extensions, including FASTA, |
|
|
125 |
GTF, VCF, and other tabular formats. Then, at the preprocessing step, the module selects only the relevant attribute |
|
|
126 |
fields specified by the user and perform necessary data cleanings. Finally, the annotation data can be annotated to an |
|
|
127 |
omics dataset by performing a SQL-like join operation on a user-specified index of the biomolecule name or ID. If the |
|
|
128 |
user wishes to import an annotation database not yet included in OpenOmics, they can extend the Annotation Dataset API |
|
|
129 |
to specify their own importing, preprocessing, and annotation tasks in an object-oriented manner. |
|
|
130 |
|
|
|
131 |
An innovative feature of our integration module is the ability to cross-reference the gene ID's between different |
|
|
132 |
annotation systems or data sources. When importing a dataset, the user can specify the level of genomic index, such as |
|
|
133 |
at the gene, transcript, protein, or peptide level, and whether it is a gene name or gene ID. Since multiple |
|
|
134 |
single-omics datasets can use different gene nomenclatures, the user is able to convert between the different gene |
|
|
135 |
indexing methods by reindexing the annotation dataframe with a index column of choice. This not only allows the |
|
|
136 |
Annotation Interface to select and join the annotation data to the correct index level, but also allow the user to |
|
|
137 |
customize the selection and aggregation of biological measurements at different levels. |
|
|
138 |
|
|
|
139 |
| Data Repository | Annotation Data Available | Index | # entries | |
|
|
140 |
| --------------- | ----------------------------------------------- | --------- | ---------- | |
|
|
141 |
| GENCODE | Genomic annotations, primary sequence | RNAs | 60,660 | |
|
|
142 |
| Ensembl | Genomic annotations | Genes | 232,186 | |
|
|
143 |
| MiRBase | MicroRNA sequences and annotatinos | MicroRNAs | 38,589 | |
|
|
144 |
| RNA Central | ncRNA sequence and annotation collection | ncRNAs | 14,784,981 | |
|
|
145 |
| NONCODE | lncRNA sequences and annotations | LncRNAs | 173,112 | |
|
|
146 |
| lncrnadb | lncRNA functional annotations | LncRNAs | 100 | |
|
|
147 |
| Pfam | Protein family annotation | Proteins | 18,259 | |
|
|
148 |
| Rfam | RNA family annotations | ncRNAs | 2,600 | |
|
|
149 |
| Gene Ontology | Functional, cellular, and molecular annotations | Genes | 44,117 | |
|
|
150 |
| KEGG | High-level functional pathways | Genes | 22,409 | |
|
|
151 |
| DisGeNet | gene-disease associations | Genes | 1,134,942 | |
|
|
152 |
| HMDD | microRNA-disease associations | MicroRNAs | 35,547 | |
|
|
153 |
| lncRNAdisease | lncRNA-disease associations | LncRNAs | 3,000 | |
|
|
154 |
| OMIM | Ontology of human diseases | Diseases | 25,670 | |
|
|
155 |
|
|
|
156 |
Table 1: Public annotation databases and availability of data in the Human genome. |
|
|
157 |
|
|
|
158 |
# System design |
|
|
159 |
|
|
|
160 |
This section describes the various implementation details behind the scalable processing and efficient data storage, and |
|
|
161 |
the design choices in the development operations. |
|
|
162 |
|
|
|
163 |
While the in-memory Pandas dataframes utilized in our data structures are fast, they have size and speed limitations |
|
|
164 |
when the dataset size approaches the system memory limit. When this is an issue, the user can enable out-of-memory |
|
|
165 |
distributed data processing on all OpenOmics operations, implemented by the Dask |
|
|
166 |
framework [@matthew_rocklin-proc-scipy-2015]. When memory resources is limited, data in a Dask dataframe can be read |
|
|
167 |
directly from disk and is only brought into memory when needed during computations (also called lazy evaluations). When |
|
|
168 |
performing data query operations on Dask dataframes, a task graph containing each operation is built and is only |
|
|
169 |
evaluated on command, in a process called lazy loading. |
|
|
170 |
|
|
|
171 |
Operations on Dask dataframes are the same as Pandas dataframes, but can utilize multiple workers and can scale up to |
|
|
172 |
clusters by connecting to a cluster client with minimal configuration. To enable this feature in OpenOmics, the user |
|
|
173 |
simply needs to explicitly enable an option when importing an omics dataset, importing an annotation/interaction |
|
|
174 |
database, or importing a MultiOmics file structure on disk. |
|
|
175 |
|
|
|
176 |
## Software requirements |
|
|
177 |
|
|
|
178 |
OpenOmics is distributed as a readily installable Python package from the Python Package Index (PyPI) repository. For |
|
|
179 |
users to install OpenOmics in their own Python environment, several software dependencies are automatically downloaded |
|
|
180 |
to reproduce the computing environment. |
|
|
181 |
|
|
|
182 |
OpenOmics is compatible with Python 3.6 or higher, and is operational on both Linux and Windows operating systems. The |
|
|
183 |
software requires as little as 4 GB of RAM and 2 CPU cores, and can computationally scale up to large-memory |
|
|
184 |
multi-worker distributed systems such as a compute cluster. To take advantage of increased computational resource, |
|
|
185 |
OpenOmics simply requires one line of code to activate parallel computing functionalities. |
|
|
186 |
|
|
|
187 |
## Development operations |
|
|
188 |
|
|
|
189 |
We developed OpenOmics following modern software best-practices and package publishing standards. For the version |
|
|
190 |
control of our source-code, we utilized a public GitHub repository which contains two branches, master and develop. The |
|
|
191 |
master branch contains stable and well-tested releases of the package, while the develop branch is used for building new |
|
|
192 |
features or software refactoring. Before each version is released, we utilize Github Actions for continuous integration, |
|
|
193 |
building, and testing for version and dependency compatibility. Our automated test suite covers essential functions of |
|
|
194 |
the package and a reasonable range of inputs and conditions. |
|
|
195 |
|
|
|
196 |
# Conclusion |
|
|
197 |
|
|
|
198 |
A standardized data preprocessing strategy is essential for the interpretation and exchange of bioinformatics research. |
|
|
199 |
OpenOmics provides researchers with the means to consistently describe the processing and analysis of their experimental |
|
|
200 |
datasets. It equips the user, a bioinformatician, with the ability to preprocess, query, and analyze data with modern |
|
|
201 |
and scalable software technology. As the wide array of tools and methods available in the public domain are largely |
|
|
202 |
isolated, OpenOmics aims toward a uniform framework that can effectively process and analyze multi-omics data in an |
|
|
203 |
end-to-end manner along with biologist-friendly visualization and interpretation. |
|
|
204 |
|
|
|
205 |
# Acknowledgements |
|
|
206 |
|
|
|
207 |
N/A. |
|
|
208 |
|
|
|
209 |
# References |