Diff of /inst/paper.md [000000] .. [548210]

Switch to side-by-side view

--- a
+++ b/inst/paper.md
@@ -0,0 +1,209 @@
+---
+title: 'OpenOmics: A bioinformatics API to integrate multi-omics datasets and interface with public databases.'
+tags:
+  - Python
+  - bioinformatics
+  - multiomics
+  - data integration
+  - big data
+authors:
+  - name: Nhat C. Tran^[corresponding author]
+    orcid: 0000-0002-2575-9633
+    affiliation: 1
+  - name: Jean X. Gao
+    affiliation: 1
+affiliations:
+  - name: Department of Computer Science and Engineering, The University of Texas at Arlington
+    index: 1
+date: 25 January 2021
+bibliography: paper.bib
+---
+
+# Summary
+
+Leveraging large-scale multi-omics data is emerging as the primary approach for systemic research of human diseases and
+general biological processes. As data integration and feature engineering are the vital steps in these bioinformatics
+projects, there currently lacks a tool for standardized preprocessing of heterogeneous multi-omics and annotation data
+within the context of a clinical cohort. OpenOmics is a Python library for integrating heterogeneous multi-omics data
+and interfacing with popular public annotation databases, e.g., GENCODE, Ensembl, BioGRID. The library is designed to be
+highly flexible to allow the user to parameterize the construction of integrated datasets, interactive to assist complex
+data exploratory analyses, and scalable to facilitate working with large datasets on standard machines. In this paper,
+we demonstrate the software design choices to support the wide-ranging use cases of OpenOmics with the goal of
+maximizing usability and reproducibility of the data integration framework.
+
+# Statement of need
+
+Recent advances in sequencing technology and computational methods have enabled the means to generate large-scale,
+high-throughput multi-omics data [@lappalainen2013transcriptome], providing unprecedented research opportunities for
+cancer and other diseases. These methods have already been applied to a number of problems within bioinformatics, and
+indeed several integrative disease
+studies [@zhang2014proteogenomic; @cancer2014comprehensive; @ren2016integration; @hassan2020integration]. In addition to
+the genome-wide measurements of different genetic characterizations, the growing public knowledge-base of functional
+annotations [@rnacentral2016rnacentral; @derrien2012gencode], experimentally-verified
+interactions [@chou2015mirtarbase; @yuan2013npinter; @chou2017mirtarbase; @oughtred2019biogrid], and gene-disease
+associations [@huang2018hmdd; @pinero2016disgenet; @chen2012lncrnadisease] also provides the prior-knowledge essential
+for system-level analyses. Leveraging these data sources allow for a systematic investigation of disease mechanisms at
+multiple molecular and regulatory layers; however, such task remains nontrivial due to the complexity of multi-omics
+data.
+
+While researchers have developed several mature tools to access or analyze a particular single omic data
+type [@wolf2018scanpy; @stuart2019integrative], the current state of integrative data platforms for multi-omics data is
+lacking due to three reasons. First, pipelines for data integration carry out a sequential tasks that does not process
+multi-omics datasets holistically. Second, the vast size and heterogeneity of the data poses a challenge on the
+necessary data storage and computational processing. And third, implementations of data pipelines are close-ended for
+down-stream analysis or not conductive to data exploration use-cases. Additionally, there is currently a need for
+increased transparency in the process of multi-omics data integration, and a standardized data preprocessing strategy is
+important for the interpretation and exchange of bioinformatic projects. Currently, there exist very few systems that,
+on the one hand, supports standardized handling of multi-omics datasets but also allows to query the integrated dataset
+within the context of a clinical cohort.
+
+# Related works
+
+There are several existing platforms that aids in the integration of multi-omics data, such as Galaxy, Anduril, MixOmics
+and O-Miner. First, Galaxy [@boekel2015multi] and Anduril [@cervera2019anduril] are mature platforms and has an
+established workflow framework for genomic and transcriptomic data analysis. Galaxy contains hundreds of
+state-of-the-art tools of these core domains for processing and assembling high-throughput sequencing data. Second,
+MixOmics [@rohart2017mixomics] is an R library dedicated to the multivariate analysis of biological data sets with a
+specific focus on data exploration, dimension reduction and visualisation. Third, O-Miner [@sangaralingam2019multi] is
+web tool that provides a pipeline for analysis of both transcriptomic and genomic data starting from raw image files
+through in-depth bioinformatics analysis. However, as large-scale multi-omic data analysis demands continue to grow, the
+technologies and data analysis needs continually change to adapt with `big data`. For instance, the data manipulation
+required for multi-omics integration requires a multitude of complex operations, but the point and click interface given
+in existing Galaxy tools can be limiting or not computationally efficient. Although the MixOmics toolkit provides an R
+programming interface, it doesn't yet leverage high-performance distributed storage or computing resources. Finally,
+while O-Miner can perform end-to-end analysis in an integrated platform, its interim analysis results cannot be exported
+elsewhere for down-stream analysis.
+
+![Overall OpenOmics System Architecture, Data Flow, and Use Cases.\label{architecture}](figure.pdf)
+
+# The OpenOmics library
+
+OpenOmics consists of two core modules: multi-omics integration and annotation interface. An overview visualization of
+the OpenOmics system architecture is provided in \autoref{architecture}.
+
+## Multi-omics integration
+
+Tabular data are everywhere in bioinformatics. To record expression quantifications, annotations, or variant calls, data
+are typically stored in various tabular-like formats, such as BED, GTF, MAF, and VCF, which can be preprocessed and
+normalized to row indexed formats. Given any processed single-omic dataset, the library generalizes the data as a
+tabular structure where rows correspond to observation samples and columns correspond to measurements of different
+biomolecules. The core functionality of the Multi-omics Integration module is to integrate the multiple single-omic
+datasets for the overlapping samples. By generating multi-omics data for the same set of samples, our tool can provide
+the necessary data structure to develop insights into the flow of biological information across multiple genome,
+epigenome, transcriptome, proteome, metabolome and phenome levels. The user can import and integrate the following
+supported omic types:
+
+- Genomics: single nucleotide variants (SNV), copy number variation (CNV)
+- Epigenomics: DNA methylation
+- Transcriptomics: RNA-Seq, miRNA expression, lncRNA expression, microarrays
+- Proteomics: reverse phase protein array (RPPA), iTRAQ
+
+After importing each single omics data, OpenOmics stores a Pandas Dataframe [@mckinney-proc-scipy-2010] that is flexible
+for a wide range of tabular operations. For instance, the user is presented with several functions for preprocessing of
+the expression quantifications to normalize, filter outliers, or reduce noise.
+
+Within a study cohort, the clinical characteristics are crucial for the study of a disease or biological phenomenon. The
+user can characterize the set of samples using the Clinical Data structure, which is comprised of two levels: Patient
+and Biospecimen. A Patient can have attribute fields on demographics, clinical diagnosis, disease progression, treatment
+responses, and survival outcomes. Typically, multi-omics data observations are captured at the Biospecimen level and
+each Patient can have multiple Biospecimens. OpenOmics tracks the ID's of biospecimens and the patient it belongs to, so
+the multi-omics data are organized in a hierarchical order to enable aggregated operations.
+
+## Annotation interface
+
+After importing and integrating the multi-omic data, the user can supplement their dataset with various annotation
+attributes from public data repositories such as GENCODE, Ensembl, and RNA Central. With just a few operations, the user
+can easily download a data repository of choice, select relevant attributes, and efficiently join a variable number of
+annotation columns to their genomics, transcriptomics, and proteomics data. The full list of databases and the
+availability of annotation attributes is listed in Table 1.
+
+For each public database, the Annotation Interface module provides a series of interfaces to perform specific importing,
+preprocessing, and annotation tasks. At the import step, the module can either fetch the database files via a
+file-transfer-protocol (ftp) URL or load a locally downloaded file. At this step, the user can specify the species,
+genome build, and version of the database by providing a ftp URL of choice. To streamline this process, the module
+automatically caches downloaded file to disk, uncompress them, and handle different file extensions, including FASTA,
+GTF, VCF, and other tabular formats. Then, at the preprocessing step, the module selects only the relevant attribute
+fields specified by the user and perform necessary data cleanings. Finally, the annotation data can be annotated to an
+omics dataset by performing a SQL-like join operation on a user-specified index of the biomolecule name or ID. If the
+user wishes to import an annotation database not yet included in OpenOmics, they can extend the Annotation Dataset API
+to specify their own importing, preprocessing, and annotation tasks in an object-oriented manner.
+
+An innovative feature of our integration module is the ability to cross-reference the gene ID's between different
+annotation systems or data sources. When importing a dataset, the user can specify the level of genomic index, such as
+at the gene, transcript, protein, or peptide level, and whether it is a gene name or gene ID. Since multiple
+single-omics datasets can use different gene nomenclatures, the user is able to convert between the different gene
+indexing methods by reindexing the annotation dataframe with a index column of choice. This not only allows the
+Annotation Interface to select and join the annotation data to the correct index level, but also allow the user to
+customize the selection and aggregation of biological measurements at different levels.
+
+| Data Repository | Annotation Data Available                       | Index     | # entries  |
+| --------------- | ----------------------------------------------- | --------- | ---------- |
+| GENCODE         | Genomic annotations, primary sequence           | RNAs      | 60,660     |
+| Ensembl         | Genomic annotations                             | Genes     | 232,186    |
+| MiRBase         | MicroRNA sequences and annotatinos              | MicroRNAs | 38,589     |
+| RNA Central     | ncRNA sequence and annotation collection        | ncRNAs    | 14,784,981 |
+| NONCODE         | lncRNA sequences and annotations                | LncRNAs   | 173,112    |
+| lncrnadb        | lncRNA functional annotations                   | LncRNAs   | 100        |
+| Pfam            | Protein family annotation                       | Proteins  | 18,259     |
+| Rfam            | RNA family annotations                          | ncRNAs    | 2,600      |
+| Gene Ontology   | Functional, cellular, and molecular annotations | Genes     | 44,117     |
+| KEGG            | High-level functional pathways                  | Genes     | 22,409     |
+| DisGeNet        | gene-disease associations                       | Genes     | 1,134,942  |
+| HMDD            | microRNA-disease associations                   | MicroRNAs | 35,547     |
+| lncRNAdisease   | lncRNA-disease associations                     | LncRNAs   | 3,000      |
+| OMIM            | Ontology of human diseases                      | Diseases  | 25,670     |
+
+Table 1: Public annotation databases and availability of data in the Human genome.
+
+# System design
+
+This section describes the various implementation details behind the scalable processing and efficient data storage, and
+the design choices in the development operations.
+
+While the in-memory Pandas dataframes utilized in our data structures are fast, they have size and speed limitations
+when the dataset size approaches the system memory limit. When this is an issue, the user can enable out-of-memory
+distributed data processing on all OpenOmics operations, implemented by the Dask
+framework [@matthew_rocklin-proc-scipy-2015]. When memory resources is limited, data in a Dask dataframe can be read
+directly from disk and is only brought into memory when needed during computations (also called lazy evaluations). When
+performing data query operations on Dask dataframes, a task graph containing each operation is built and is only
+evaluated on command, in a process called lazy loading.
+
+Operations on Dask dataframes are the same as Pandas dataframes, but can utilize multiple workers and can scale up to
+clusters by connecting to a cluster client with minimal configuration. To enable this feature in OpenOmics, the user
+simply needs to explicitly enable an option when importing an omics dataset, importing an annotation/interaction
+database, or importing a MultiOmics file structure on disk.
+
+## Software requirements
+
+OpenOmics is distributed as a readily installable Python package from the Python Package Index (PyPI) repository. For
+users to install OpenOmics in their own Python environment, several software dependencies are automatically downloaded
+to reproduce the computing environment.
+
+OpenOmics is compatible with Python 3.6 or higher, and is operational on both Linux and Windows operating systems. The
+software requires as little as 4 GB of RAM and 2 CPU cores, and can computationally scale up to large-memory
+multi-worker distributed systems such as a compute cluster. To take advantage of increased computational resource,
+OpenOmics simply requires one line of code to activate parallel computing functionalities.
+
+## Development operations
+
+We developed OpenOmics following modern software best-practices and package publishing standards. For the version
+control of our source-code, we utilized a public GitHub repository which contains two branches, master and develop. The
+master branch contains stable and well-tested releases of the package, while the develop branch is used for building new
+features or software refactoring. Before each version is released, we utilize Github Actions for continuous integration,
+building, and testing for version and dependency compatibility. Our automated test suite covers essential functions of
+the package and a reasonable range of inputs and conditions.
+
+# Conclusion
+
+A standardized data preprocessing strategy is essential for the interpretation and exchange of bioinformatics research.
+OpenOmics provides researchers with the means to consistently describe the processing and analysis of their experimental
+datasets. It equips the user, a bioinformatician, with the ability to preprocess, query, and analyze data with modern
+and scalable software technology. As the wide array of tools and methods available in the public domain are largely
+isolated, OpenOmics aims toward a uniform framework that can effectively process and analyze multi-omics data in an
+end-to-end manner along with biologist-friendly visualization and interpretation.
+
+# Acknowledgements
+
+N/A.
+
+# References