Diff of /inst/paper.md [000000] .. [548210]

Switch to unified view

a b/inst/paper.md
1
---
2
title: 'OpenOmics: A bioinformatics API to integrate multi-omics datasets and interface with public databases.'
3
tags:
4
  - Python
5
  - bioinformatics
6
  - multiomics
7
  - data integration
8
  - big data
9
authors:
10
  - name: Nhat C. Tran^[corresponding author]
11
    orcid: 0000-0002-2575-9633
12
    affiliation: 1
13
  - name: Jean X. Gao
14
    affiliation: 1
15
affiliations:
16
  - name: Department of Computer Science and Engineering, The University of Texas at Arlington
17
    index: 1
18
date: 25 January 2021
19
bibliography: paper.bib
20
---
21
22
# Summary
23
24
Leveraging large-scale multi-omics data is emerging as the primary approach for systemic research of human diseases and
25
general biological processes. As data integration and feature engineering are the vital steps in these bioinformatics
26
projects, there currently lacks a tool for standardized preprocessing of heterogeneous multi-omics and annotation data
27
within the context of a clinical cohort. OpenOmics is a Python library for integrating heterogeneous multi-omics data
28
and interfacing with popular public annotation databases, e.g., GENCODE, Ensembl, BioGRID. The library is designed to be
29
highly flexible to allow the user to parameterize the construction of integrated datasets, interactive to assist complex
30
data exploratory analyses, and scalable to facilitate working with large datasets on standard machines. In this paper,
31
we demonstrate the software design choices to support the wide-ranging use cases of OpenOmics with the goal of
32
maximizing usability and reproducibility of the data integration framework.
33
34
# Statement of need
35
36
Recent advances in sequencing technology and computational methods have enabled the means to generate large-scale,
37
high-throughput multi-omics data [@lappalainen2013transcriptome], providing unprecedented research opportunities for
38
cancer and other diseases. These methods have already been applied to a number of problems within bioinformatics, and
39
indeed several integrative disease
40
studies [@zhang2014proteogenomic; @cancer2014comprehensive; @ren2016integration; @hassan2020integration]. In addition to
41
the genome-wide measurements of different genetic characterizations, the growing public knowledge-base of functional
42
annotations [@rnacentral2016rnacentral; @derrien2012gencode], experimentally-verified
43
interactions [@chou2015mirtarbase; @yuan2013npinter; @chou2017mirtarbase; @oughtred2019biogrid], and gene-disease
44
associations [@huang2018hmdd; @pinero2016disgenet; @chen2012lncrnadisease] also provides the prior-knowledge essential
45
for system-level analyses. Leveraging these data sources allow for a systematic investigation of disease mechanisms at
46
multiple molecular and regulatory layers; however, such task remains nontrivial due to the complexity of multi-omics
47
data.
48
49
While researchers have developed several mature tools to access or analyze a particular single omic data
50
type [@wolf2018scanpy; @stuart2019integrative], the current state of integrative data platforms for multi-omics data is
51
lacking due to three reasons. First, pipelines for data integration carry out a sequential tasks that does not process
52
multi-omics datasets holistically. Second, the vast size and heterogeneity of the data poses a challenge on the
53
necessary data storage and computational processing. And third, implementations of data pipelines are close-ended for
54
down-stream analysis or not conductive to data exploration use-cases. Additionally, there is currently a need for
55
increased transparency in the process of multi-omics data integration, and a standardized data preprocessing strategy is
56
important for the interpretation and exchange of bioinformatic projects. Currently, there exist very few systems that,
57
on the one hand, supports standardized handling of multi-omics datasets but also allows to query the integrated dataset
58
within the context of a clinical cohort.
59
60
# Related works
61
62
There are several existing platforms that aids in the integration of multi-omics data, such as Galaxy, Anduril, MixOmics
63
and O-Miner. First, Galaxy [@boekel2015multi] and Anduril [@cervera2019anduril] are mature platforms and has an
64
established workflow framework for genomic and transcriptomic data analysis. Galaxy contains hundreds of
65
state-of-the-art tools of these core domains for processing and assembling high-throughput sequencing data. Second,
66
MixOmics [@rohart2017mixomics] is an R library dedicated to the multivariate analysis of biological data sets with a
67
specific focus on data exploration, dimension reduction and visualisation. Third, O-Miner [@sangaralingam2019multi] is
68
web tool that provides a pipeline for analysis of both transcriptomic and genomic data starting from raw image files
69
through in-depth bioinformatics analysis. However, as large-scale multi-omic data analysis demands continue to grow, the
70
technologies and data analysis needs continually change to adapt with `big data`. For instance, the data manipulation
71
required for multi-omics integration requires a multitude of complex operations, but the point and click interface given
72
in existing Galaxy tools can be limiting or not computationally efficient. Although the MixOmics toolkit provides an R
73
programming interface, it doesn't yet leverage high-performance distributed storage or computing resources. Finally,
74
while O-Miner can perform end-to-end analysis in an integrated platform, its interim analysis results cannot be exported
75
elsewhere for down-stream analysis.
76
77
![Overall OpenOmics System Architecture, Data Flow, and Use Cases.\label{architecture}](figure.pdf)
78
79
# The OpenOmics library
80
81
OpenOmics consists of two core modules: multi-omics integration and annotation interface. An overview visualization of
82
the OpenOmics system architecture is provided in \autoref{architecture}.
83
84
## Multi-omics integration
85
86
Tabular data are everywhere in bioinformatics. To record expression quantifications, annotations, or variant calls, data
87
are typically stored in various tabular-like formats, such as BED, GTF, MAF, and VCF, which can be preprocessed and
88
normalized to row indexed formats. Given any processed single-omic dataset, the library generalizes the data as a
89
tabular structure where rows correspond to observation samples and columns correspond to measurements of different
90
biomolecules. The core functionality of the Multi-omics Integration module is to integrate the multiple single-omic
91
datasets for the overlapping samples. By generating multi-omics data for the same set of samples, our tool can provide
92
the necessary data structure to develop insights into the flow of biological information across multiple genome,
93
epigenome, transcriptome, proteome, metabolome and phenome levels. The user can import and integrate the following
94
supported omic types:
95
96
- Genomics: single nucleotide variants (SNV), copy number variation (CNV)
97
- Epigenomics: DNA methylation
98
- Transcriptomics: RNA-Seq, miRNA expression, lncRNA expression, microarrays
99
- Proteomics: reverse phase protein array (RPPA), iTRAQ
100
101
After importing each single omics data, OpenOmics stores a Pandas Dataframe [@mckinney-proc-scipy-2010] that is flexible
102
for a wide range of tabular operations. For instance, the user is presented with several functions for preprocessing of
103
the expression quantifications to normalize, filter outliers, or reduce noise.
104
105
Within a study cohort, the clinical characteristics are crucial for the study of a disease or biological phenomenon. The
106
user can characterize the set of samples using the Clinical Data structure, which is comprised of two levels: Patient
107
and Biospecimen. A Patient can have attribute fields on demographics, clinical diagnosis, disease progression, treatment
108
responses, and survival outcomes. Typically, multi-omics data observations are captured at the Biospecimen level and
109
each Patient can have multiple Biospecimens. OpenOmics tracks the ID's of biospecimens and the patient it belongs to, so
110
the multi-omics data are organized in a hierarchical order to enable aggregated operations.
111
112
## Annotation interface
113
114
After importing and integrating the multi-omic data, the user can supplement their dataset with various annotation
115
attributes from public data repositories such as GENCODE, Ensembl, and RNA Central. With just a few operations, the user
116
can easily download a data repository of choice, select relevant attributes, and efficiently join a variable number of
117
annotation columns to their genomics, transcriptomics, and proteomics data. The full list of databases and the
118
availability of annotation attributes is listed in Table 1.
119
120
For each public database, the Annotation Interface module provides a series of interfaces to perform specific importing,
121
preprocessing, and annotation tasks. At the import step, the module can either fetch the database files via a
122
file-transfer-protocol (ftp) URL or load a locally downloaded file. At this step, the user can specify the species,
123
genome build, and version of the database by providing a ftp URL of choice. To streamline this process, the module
124
automatically caches downloaded file to disk, uncompress them, and handle different file extensions, including FASTA,
125
GTF, VCF, and other tabular formats. Then, at the preprocessing step, the module selects only the relevant attribute
126
fields specified by the user and perform necessary data cleanings. Finally, the annotation data can be annotated to an
127
omics dataset by performing a SQL-like join operation on a user-specified index of the biomolecule name or ID. If the
128
user wishes to import an annotation database not yet included in OpenOmics, they can extend the Annotation Dataset API
129
to specify their own importing, preprocessing, and annotation tasks in an object-oriented manner.
130
131
An innovative feature of our integration module is the ability to cross-reference the gene ID's between different
132
annotation systems or data sources. When importing a dataset, the user can specify the level of genomic index, such as
133
at the gene, transcript, protein, or peptide level, and whether it is a gene name or gene ID. Since multiple
134
single-omics datasets can use different gene nomenclatures, the user is able to convert between the different gene
135
indexing methods by reindexing the annotation dataframe with a index column of choice. This not only allows the
136
Annotation Interface to select and join the annotation data to the correct index level, but also allow the user to
137
customize the selection and aggregation of biological measurements at different levels.
138
139
| Data Repository | Annotation Data Available                       | Index     | # entries  |
140
| --------------- | ----------------------------------------------- | --------- | ---------- |
141
| GENCODE         | Genomic annotations, primary sequence           | RNAs      | 60,660     |
142
| Ensembl         | Genomic annotations                             | Genes     | 232,186    |
143
| MiRBase         | MicroRNA sequences and annotatinos              | MicroRNAs | 38,589     |
144
| RNA Central     | ncRNA sequence and annotation collection        | ncRNAs    | 14,784,981 |
145
| NONCODE         | lncRNA sequences and annotations                | LncRNAs   | 173,112    |
146
| lncrnadb        | lncRNA functional annotations                   | LncRNAs   | 100        |
147
| Pfam            | Protein family annotation                       | Proteins  | 18,259     |
148
| Rfam            | RNA family annotations                          | ncRNAs    | 2,600      |
149
| Gene Ontology   | Functional, cellular, and molecular annotations | Genes     | 44,117     |
150
| KEGG            | High-level functional pathways                  | Genes     | 22,409     |
151
| DisGeNet        | gene-disease associations                       | Genes     | 1,134,942  |
152
| HMDD            | microRNA-disease associations                   | MicroRNAs | 35,547     |
153
| lncRNAdisease   | lncRNA-disease associations                     | LncRNAs   | 3,000      |
154
| OMIM            | Ontology of human diseases                      | Diseases  | 25,670     |
155
156
Table 1: Public annotation databases and availability of data in the Human genome.
157
158
# System design
159
160
This section describes the various implementation details behind the scalable processing and efficient data storage, and
161
the design choices in the development operations.
162
163
While the in-memory Pandas dataframes utilized in our data structures are fast, they have size and speed limitations
164
when the dataset size approaches the system memory limit. When this is an issue, the user can enable out-of-memory
165
distributed data processing on all OpenOmics operations, implemented by the Dask
166
framework [@matthew_rocklin-proc-scipy-2015]. When memory resources is limited, data in a Dask dataframe can be read
167
directly from disk and is only brought into memory when needed during computations (also called lazy evaluations). When
168
performing data query operations on Dask dataframes, a task graph containing each operation is built and is only
169
evaluated on command, in a process called lazy loading.
170
171
Operations on Dask dataframes are the same as Pandas dataframes, but can utilize multiple workers and can scale up to
172
clusters by connecting to a cluster client with minimal configuration. To enable this feature in OpenOmics, the user
173
simply needs to explicitly enable an option when importing an omics dataset, importing an annotation/interaction
174
database, or importing a MultiOmics file structure on disk.
175
176
## Software requirements
177
178
OpenOmics is distributed as a readily installable Python package from the Python Package Index (PyPI) repository. For
179
users to install OpenOmics in their own Python environment, several software dependencies are automatically downloaded
180
to reproduce the computing environment.
181
182
OpenOmics is compatible with Python 3.6 or higher, and is operational on both Linux and Windows operating systems. The
183
software requires as little as 4 GB of RAM and 2 CPU cores, and can computationally scale up to large-memory
184
multi-worker distributed systems such as a compute cluster. To take advantage of increased computational resource,
185
OpenOmics simply requires one line of code to activate parallel computing functionalities.
186
187
## Development operations
188
189
We developed OpenOmics following modern software best-practices and package publishing standards. For the version
190
control of our source-code, we utilized a public GitHub repository which contains two branches, master and develop. The
191
master branch contains stable and well-tested releases of the package, while the develop branch is used for building new
192
features or software refactoring. Before each version is released, we utilize Github Actions for continuous integration,
193
building, and testing for version and dependency compatibility. Our automated test suite covers essential functions of
194
the package and a reasonable range of inputs and conditions.
195
196
# Conclusion
197
198
A standardized data preprocessing strategy is essential for the interpretation and exchange of bioinformatics research.
199
OpenOmics provides researchers with the means to consistently describe the processing and analysis of their experimental
200
datasets. It equips the user, a bioinformatician, with the ability to preprocess, query, and analyze data with modern
201
and scalable software technology. As the wide array of tools and methods available in the public domain are largely
202
isolated, OpenOmics aims toward a uniform framework that can effectively process and analyze multi-omics data in an
203
end-to-end manner along with biologist-friendly visualization and interpretation.
204
205
# Acknowledgements
206
207
N/A.
208
209
# References