Diff of /docs/storage.rst [000000] .. [d7cf27]

Switch to side-by-side view

--- a
+++ b/docs/storage.rst
@@ -0,0 +1,427 @@
+.. _storage:
+
+================
+Genomic Datasets
+================
+
+One of the central features of Janggu are the genomic datasets :code:`Cover` and
+:code:`Bioseq`. On the one hand, they allow
+quick and flexible access to genomics data, including **FASTA**,
+**BAM**, **BIGWIG**, **BED** and **GFF** file formats, which bridges the gap
+between the data being present in raw file formats
+and the numpy inputs required for python-based deep learning models (e.g. keras).
+On the other hand, predictions from a deep learning library again are in numpy
+format. Janggu facilitates a convertion between numpy arrays and :code:`Cover` objects
+in order to associate the predictions with the respective genomic coordinates.
+Finally, coverage information may be exported to BIGWIG or inspected directly
+via  genome browser-like plots.
+
+
+General principle of the Genomics Datasets
+------------------------------------------
+Internally, the genomics datasets maintains coverage or
+sequence type of information along with the associated genomic intervals.
+Externally, the datasets behave similar to a numpy array. This
+makes it possible to directly consume the datasets using keras, for instance.
+
+In this section we briefly describe the internal organisation of these datasets.
+The classes :code:`Cover` and :code:`Bioseq` maintain a
+:code:`GenomicArray` and a :code:`GenomicIndexer` object.
+:code:`GenomicArray` is a general data structure that holds numeric
+information about the genome. For instance, read count coverage.
+It can be accessed via
+genomic coordinates (e.g. chromosome name, start and end) and returns
+the respective data as numpy array.
+The :code:`GenomicIndexer` maintains a list of genomic coordinates,
+which should be traversed during training/evaluation.
+The GenomicIndexer can be accessed by an integer-valued
+index which returns the associated genomic coordinates.
+
+When querying the 'i'-th region from :code:`Cover` or :code:`Bioseq`, the index is passed
+to the  :code:`GenomicIndexer` which yields a genomic coordinates
+that is passed on to the :code:`GenomicArray`.
+The result is returned in numpy format.
+Similarly, the dataset objects
+also support slicing and indexing via a list of indices, which is usually relevant
+when using mini-batch learning.
+
+
+Normalization
+-------------
+Upon creation of a :code:`Cover` object, normalization of the raw data might be require.
+For instance, to make coverage tracks comparable across replicates or experiments.
+To this end, :code:`create_from_bam`, :code:`create_from_bigwig`
+and :code:`create_from_bed` expose a :code:`normalizer` option.
+Janggu already implements various normalization methods which can be called by name,
+TPM (transcript per million) normalization. For instance, using
+
+.. code-block:: python
+
+   Cover.create_from_bam('track', bamfiles=samplefile, roi=roi, normalizer='tpm')
+
+Other preprocessing and normalization options are:  :code:`zscore`, :code:`zscorelog`, :code:`binsizenorm` and :code:`perctrim`.
+The latter two apply normalization for read depth and trimming the signal intensities at the 99%-ile.
+
+Normalizers can also be applied via callables and/or in combination with other transformations.
+For instance, suppose we want to trim the outliers at the 95%-tile instead and
+subsequently apply the z-score transformation then we could use
+
+.. code-block:: python
+
+   from janggu.data import PercentileTrimming
+   from janggu.data import ZScore
+
+   Cover.create_from_bam('track', bamfiles=samplefile, roi=roi,
+                         normalizer=[PercentileTrimming(95), ZScore()])
+
+
+It might be necessary to evaluate the normalization parameter on one dataset and apply the same
+transformation on other datasets. For instance, in the case of the ZScore, we might want to keep
+the mean and standard deviation that was obtained from, say the training set, and reuse the
+to normalize the test set.
+This is possible by just creating a zscore object that is used multiple times.
+At the first invokation the mean and standard deviation are determine and the transformation
+is applied. Subsequently, the zscore is determined using the predetermined mean and standard deviation.
+For example:
+
+.. code-block:: python
+
+   from janggu.data import ZScore
+
+   zscore = ZScore()
+
+   # First, mean and std will be determined.
+   # Then zscore transformation is applied.
+   Cover.create_from_bam('track_train', bamfiles=samplefile, roi=roitrain,
+                         normalizer=[zscore])
+
+   # Subsequently, zscore transformation is applied with
+   # the same mean and std determined from the training set.
+   Cover.create_from_bam('track_test', bamfiles=samplefile, roi=roitest,
+                         normalizer=[zscore])
+
+In case a different normalization procedure is required that is not contained in janggu,
+it is possible to define a custom_normalizer as follows:
+
+.. code-block:: python
+
+   def custom_normalizer(genomicarray):
+
+      # perform normalization genomicarray
+
+      return genomicarray
+
+The currently implemented normalizers may be a good starting point
+for this purpose.
+
+
+Granularity of the coverage
+----------------------------
+
+Depending on the applications, different granularity of the
+coverage data might be required. For instance, one might be interested in reading out
+nucleotide-resolution coverage for one purpose or 50 base-pair resolution bins for another.
+Furthermore, in some cases the signal of variable size regions might be of interest. For
+example, the read counts across the gene bodies, to measure gene expression levels.
+
+These adjustments can be made when invoking :code:`create_from_bam`,
+:code:`create_from_bigwig` and :code:`create_from_bed`
+using an appropriate region of interest ROI file in conjunction
+with specifying the :code:`resolution` and  :code:`collapser` parameter.
+
+First, we the resolution parameter allows to the coverage granularity.
+For example, base-pair and 50-base-pair resolution would be possible using
+
+.. code-block:: python
+
+   Cover.create_from_bam('track', bamfiles=samplefile, roi=roi,
+                         resolution=1)
+
+   Cover.create_from_bam('track', bamfiles=samplefile, roi=roi,
+                         resolution=50)
+
+.. sidebar:: janggu-trim
+
+  When using N-based pair resolution with :code:`n>1` in conjunction with the
+  option :code:`store_whole_genome=True`, then the region of interest starts
+  and ends must be divisible by the resolution. Otherwise, undesired rounding
+  effect might occur. This can be achieved by using :code:`janggu-trim`.
+  See Section command line tools.
+
+In case the signal intensity should be summarized across the entire interval,
+specify :code:`resolution=None`.
+For instance, if the region of interest contains a set of variable length
+gene bodies, the total read count per gene can be obtained using
+
+.. code-block:: python
+
+   Cover.create_from_bam('genes',
+                         bamfiles=samplefile,
+                         roi=geneannot,
+                         resolution=None)
+
+It is also possible to use :code:`resolution=None` in conjunction with e.g. :code:`binsize=200`
+which would have the same effect as chosing :code:`binsize=resolution=200`.
+
+Whenever we deal with :code:`resolution>1`, an aggregation operation needs to be performed
+to summarize the signal intensity across the region. For instance, for
+:code:`create_from_bam` the reads are summed within each interval.
+
+For :code:`create_from_bigwig` and :code:`create_from_bed`,
+it is possible to adjust the collapser. For example, 'mean' or 'sum' aggregation
+can be applied by name or by handing over a callable according to
+
+.. code-block:: python
+
+   import numpy as np
+
+   Cover.create_from_bigwig('bwtrack',
+                            bigwigfiles=samplefile,
+                            roi=roi,
+                            resolution=50,
+                            collapser='mean')
+
+   Cover.create_from_bigwig('bwtrack',
+                            bigwigfiles=samplefile,
+                            roi=roi,
+                            resolution=50,
+                            collapser=np.sum)
+
+
+Moreover, more specialized aggregations may
+require a custom collaper function. In that case,
+it is important to note that the function expects a 3D numpy array and
+the aggragation should be performed across the second dimension.
+For example
+
+.. code-block:: python
+
+   def custom_collapser(numpyarray):
+
+      # Initially, the dimensions of numpyarray correspond to
+      # (intervallength // resolution, resolution, strand)
+
+      numpyarray = numpyarray.sum(axis=1)
+
+      # Subsequently, we return the array of shape
+      # (intervallength // resolution, strand)
+
+      return numpyarray
+
+
+Caching
+--------
+
+The construction, including loading and preprocessing,
+of a genomic dataset might require a significant amount of time.
+In order to avoid having to create the coverage profiles each time you want
+to use them, they can be cached and quickly reloaded
+later.
+Caching can be activated via the options :code:`cache=True`.
+When caching is required, janggu will check for changes in the
+file content, file composition and various dataset specific argument
+(e.g. binsize, resolution) by constructing a SHA256. The dataset will
+be loaded or reloaded from scratch if the determined hash does not exist.
+
+Example:
+
+.. code:: python
+
+   # load hg19 if the cache file does not exist yet, otherwise
+   # reload it.
+   Bioseq.create_from_refgenome('dna', refgenome, order=1, cache=True)
+
+
+Dataset storage
+---------------
+
+Storage option
+==============
+Depending on the structure of the dataset, the required memory to store the data
+and the available memory on your machine, different storage options are available
+for the genomic datasets, including **numpy array**, as **sparse array** or as **hdf5 dataset**.
+To this end, :code:`create_from_bam`, :code:`create_from_bigwig`,
+:code:`create_from_bed`, :code:`create_from_seq`
+and :code:`create_from_refgenome` expose the `storage` option, which may be 'ndarray',
+'sparse' or 'hdf5', respectively.
+
+'ndarray' amounts to perhaps the fastest access time,
+but also most memory demanding option for storing the data.
+It might be useful for dense datasets, and relatively small datasets that conveniently
+fit into memory.
+
+If the data is sparse, the option `sparse` yields a good compromise between access time
+and speed. In that case, the data is stored in its compressed sparse form and converted
+to a dense representation when querying mini-batches.
+This option may be used to store e.g. genome wide ChIP-seq peaks profiles, if peaks
+occur relatively rarely.
+
+Finally, if the data is too large to be kept in memory, the option
+`hdf5` allows to consume the data directly from disk. While,
+the access time for processing data from hdf5 files may be higher,
+it allows to processing huge datasets with a small amount of RAM in your machine.
+
+Whole and partial genome storage
+================================
+
+:code:`Cover` and :code:`Bioseq` further allow to maintain coverage and sequence information
+from the entire genome or only the part that is actively consumed during training.
+This option can be configured by :code:`store_whole_genome=True/False`.
+
+In most situations, the user may find it convenient to set `store_whole_genome=False`.
+In that case, when loading :code:`Cover` and :code:`Bioseq` only information overlapping
+the region of interest will be gathered. The advantage of this would be not to have
+to store an overhead of information when only a small part of the genome is of interest
+for consumption.
+
+On the other hand, `store_whole_genome=True` might be an advantage
+for the following purposes:
+
+1. If a large part of the genome is consumed for training/evaluation
+2. If in addition the `stepsize` for traversing the genome is smaller than `binsize`, in which case mutually overlapping intervals do not have to be stored redundantly.
+3. It simplifies sharing of the same genomic array for different tasks. For example, during training and testing different parts of the same genomic array may be consumed.
+
+
+
+Converting Numpy to Cover
+-------------------------
+
+When performing predictions, e.g. with a keras model,
+the output corresponds to an ordinary numpy array.
+In order to reestablish the association of the predicted values
+with the genomic coordinates **Cover** exposes the constructor: `create_from_array`.
+Upon invocation, a new :code:`Cover` object is composed that holds the predicted values.
+These predictions may subsequently be illustrated via `plotGenomeTrack` or exported to a BIGWIG file.
+
+
+Evaluation features
+----------------------------
+
+:code:`Cover` objects may be exported as BIGWIG files. Accordingly,
+for each condition in the :code:`Cover` a file will be created.
+
+It is also possible to illustrate predictions in terms of
+a genome browser-like plot using `plotGenomeTrack`, allowing to interactively explore
+prediction scores (perhaps in comparison with the true labels) or
+feature activities of the internal layers of a neural net.
+`plotGenomeTrack` return a matplotlib figure that can be stored into a file
+using native matplotlib functionality.
+
+
+Rearranging channel dimensions
+------------------------------
+
+Depending on the deep learning library that is used, the dimensionality
+of the tensors need to be set up in a specific order.
+For example, tensorflow expects the channel to be represented by the last
+dimension, while theano or pytorch expect the channel at the first dimension.
+With the option `channel_last=True/False` it is possible to configure the output
+dimensionality of :code:`Cover` and :code:`Bioseq`.
+
+Wrapper Datasets
+----------------
+
+A Cover object is represents a 4D object. However, sometimes one or more
+dimensions of Cover might be single dimensional (e.g. containing only one element).
+These dimensions can be dropped using :code:`ReduceDim`.
+For example :code:`ReduceDim(cover)`.
+
+
+Different views datasets
+------------------------
+
+Suppose you already have loaded DNA sequence from a reference genome
+and you want to use a different parts of it
+for training and validating the model performance.
+This is achieved by the view mechanism, which allows to
+reuse the same dataset by instantiating views that reading out different subsets.
+
+For example, a view constituting the training and test set, respectively.
+
+.. code-block:: python
+
+    # union ROI for training and test set.
+    ROI_FILE = resource_filename('janggu', 'resources/roi.bed')
+    ROI_TRAIN_FILE = resource_filename('janggu', 'resources/roi_train.bed')
+    ROI_TEST_FILE = resource_filename('janggu', 'resources/roi_test.bed')
+
+    DNA = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,
+                                       roi=ROI_FILE,
+                                       binsize=200,
+                                       store_whole_genome=True)
+
+    DNA_TRAIN = view(DNA, ROI_TRAIN_FILE)
+    DNA_TEST = view(DNA, ROI_TEST_FILE)
+
+
+Since underneath the actual dataset is just referenced rather than copied,
+the memory footprint won't increase. It just allows to read out different parts
+of the genome.
+
+An example is illustrated in the `using view notebook <https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/janggu_convnet_examples_with_view.ipynb>`_.
+
+
+Randomized dataset
+------------------
+
+In order to achieve good predictive performances,
+it is recommended to randomize the mini-batches  during model fitting.
+This is usually achieved by specifying `shuffle=True` in the fit method.
+
+However, when using HDF5 dataset, this approach may be prohibitively slow due
+to the limitations that data from HDF5 files need to be accessed in chunks
+rather than in random access fashion.
+
+In order to overcome this issue, it is possible to randomize the dataset
+already during loading time such that the data can be consumed later
+by reading coherent chunks by setting  `shuffle=False`.
+
+For example, randomization is induced by specifying an integer-valued
+:code:`random_state` as in the example below
+
+.. code-block:: python
+
+    DNA = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,
+                                       roi=ROI_TRAIN_FILE,
+                                       binsize=200,
+                                       storage='hdf5',
+                                       cache=True,
+                                       store_whole_genome=False,
+                                       random_state=43)
+
+For this option to be effective and correct, all datasets consumed during
+e.g. training need to be provided with the same :code:`random_state` value.
+Furthermore, the HDF5 file needs to be stored with :code:`store_whole_genome=False`,
+since data storage is not affected by the random_state when the entire genome
+is stored.
+An example is illustrated in the `using hdf5 notebook <https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/janggu_convnet_examples_with_hdf5.ipynb>`_.
+
+==============================
+Output directory configuration
+==============================
+
+Optionally, janggu produces various kinds of output files, including cache files
+for the datasets, log files for monitoring the training / evaluation procedure,
+stored model parameters or summary output files about the evaluation performance.
+
+The root directory specifying the janggu output location can be configured
+via setting the environment variable :code:`JANGGU_OUTPUT`.
+This might be done in the following ways:
+
+Setting the directory globally::
+
+   export JANGGU_OUTPUT='/output/dir'
+
+on startup of the script::
+
+  JANGGU_OUTPUT='/output/dir' python classify.py
+
+or inside your model script using
+
+.. code:: python
+
+   import os
+   os.environ['JANGGU_OUTPUT']='/output/dir'
+
+If  :code:`JANGGU_OUTPUT` is not set, root directory will be set
+to :code:`/home/user/janggu_results`.