Diff of /docs/storage.rst [000000] .. [d7cf27]

Switch to unified view

a b/docs/storage.rst
1
.. _storage:
2
3
================
4
Genomic Datasets
5
================
6
7
One of the central features of Janggu are the genomic datasets :code:`Cover` and
8
:code:`Bioseq`. On the one hand, they allow
9
quick and flexible access to genomics data, including **FASTA**,
10
**BAM**, **BIGWIG**, **BED** and **GFF** file formats, which bridges the gap
11
between the data being present in raw file formats
12
and the numpy inputs required for python-based deep learning models (e.g. keras).
13
On the other hand, predictions from a deep learning library again are in numpy
14
format. Janggu facilitates a convertion between numpy arrays and :code:`Cover` objects
15
in order to associate the predictions with the respective genomic coordinates.
16
Finally, coverage information may be exported to BIGWIG or inspected directly
17
via  genome browser-like plots.
18
19
20
General principle of the Genomics Datasets
21
------------------------------------------
22
Internally, the genomics datasets maintains coverage or
23
sequence type of information along with the associated genomic intervals.
24
Externally, the datasets behave similar to a numpy array. This
25
makes it possible to directly consume the datasets using keras, for instance.
26
27
In this section we briefly describe the internal organisation of these datasets.
28
The classes :code:`Cover` and :code:`Bioseq` maintain a
29
:code:`GenomicArray` and a :code:`GenomicIndexer` object.
30
:code:`GenomicArray` is a general data structure that holds numeric
31
information about the genome. For instance, read count coverage.
32
It can be accessed via
33
genomic coordinates (e.g. chromosome name, start and end) and returns
34
the respective data as numpy array.
35
The :code:`GenomicIndexer` maintains a list of genomic coordinates,
36
which should be traversed during training/evaluation.
37
The GenomicIndexer can be accessed by an integer-valued
38
index which returns the associated genomic coordinates.
39
40
When querying the 'i'-th region from :code:`Cover` or :code:`Bioseq`, the index is passed
41
to the  :code:`GenomicIndexer` which yields a genomic coordinates
42
that is passed on to the :code:`GenomicArray`.
43
The result is returned in numpy format.
44
Similarly, the dataset objects
45
also support slicing and indexing via a list of indices, which is usually relevant
46
when using mini-batch learning.
47
48
49
Normalization
50
-------------
51
Upon creation of a :code:`Cover` object, normalization of the raw data might be require.
52
For instance, to make coverage tracks comparable across replicates or experiments.
53
To this end, :code:`create_from_bam`, :code:`create_from_bigwig`
54
and :code:`create_from_bed` expose a :code:`normalizer` option.
55
Janggu already implements various normalization methods which can be called by name,
56
TPM (transcript per million) normalization. For instance, using
57
58
.. code-block:: python
59
60
   Cover.create_from_bam('track', bamfiles=samplefile, roi=roi, normalizer='tpm')
61
62
Other preprocessing and normalization options are:  :code:`zscore`, :code:`zscorelog`, :code:`binsizenorm` and :code:`perctrim`.
63
The latter two apply normalization for read depth and trimming the signal intensities at the 99%-ile.
64
65
Normalizers can also be applied via callables and/or in combination with other transformations.
66
For instance, suppose we want to trim the outliers at the 95%-tile instead and
67
subsequently apply the z-score transformation then we could use
68
69
.. code-block:: python
70
71
   from janggu.data import PercentileTrimming
72
   from janggu.data import ZScore
73
74
   Cover.create_from_bam('track', bamfiles=samplefile, roi=roi,
75
                         normalizer=[PercentileTrimming(95), ZScore()])
76
77
78
It might be necessary to evaluate the normalization parameter on one dataset and apply the same
79
transformation on other datasets. For instance, in the case of the ZScore, we might want to keep
80
the mean and standard deviation that was obtained from, say the training set, and reuse the
81
to normalize the test set.
82
This is possible by just creating a zscore object that is used multiple times.
83
At the first invokation the mean and standard deviation are determine and the transformation
84
is applied. Subsequently, the zscore is determined using the predetermined mean and standard deviation.
85
For example:
86
87
.. code-block:: python
88
89
   from janggu.data import ZScore
90
91
   zscore = ZScore()
92
93
   # First, mean and std will be determined.
94
   # Then zscore transformation is applied.
95
   Cover.create_from_bam('track_train', bamfiles=samplefile, roi=roitrain,
96
                         normalizer=[zscore])
97
98
   # Subsequently, zscore transformation is applied with
99
   # the same mean and std determined from the training set.
100
   Cover.create_from_bam('track_test', bamfiles=samplefile, roi=roitest,
101
                         normalizer=[zscore])
102
103
In case a different normalization procedure is required that is not contained in janggu,
104
it is possible to define a custom_normalizer as follows:
105
106
.. code-block:: python
107
108
   def custom_normalizer(genomicarray):
109
110
      # perform normalization genomicarray
111
112
      return genomicarray
113
114
The currently implemented normalizers may be a good starting point
115
for this purpose.
116
117
118
Granularity of the coverage
119
----------------------------
120
121
Depending on the applications, different granularity of the
122
coverage data might be required. For instance, one might be interested in reading out
123
nucleotide-resolution coverage for one purpose or 50 base-pair resolution bins for another.
124
Furthermore, in some cases the signal of variable size regions might be of interest. For
125
example, the read counts across the gene bodies, to measure gene expression levels.
126
127
These adjustments can be made when invoking :code:`create_from_bam`,
128
:code:`create_from_bigwig` and :code:`create_from_bed`
129
using an appropriate region of interest ROI file in conjunction
130
with specifying the :code:`resolution` and  :code:`collapser` parameter.
131
132
First, we the resolution parameter allows to the coverage granularity.
133
For example, base-pair and 50-base-pair resolution would be possible using
134
135
.. code-block:: python
136
137
   Cover.create_from_bam('track', bamfiles=samplefile, roi=roi,
138
                         resolution=1)
139
140
   Cover.create_from_bam('track', bamfiles=samplefile, roi=roi,
141
                         resolution=50)
142
143
.. sidebar:: janggu-trim
144
145
  When using N-based pair resolution with :code:`n>1` in conjunction with the
146
  option :code:`store_whole_genome=True`, then the region of interest starts
147
  and ends must be divisible by the resolution. Otherwise, undesired rounding
148
  effect might occur. This can be achieved by using :code:`janggu-trim`.
149
  See Section command line tools.
150
151
In case the signal intensity should be summarized across the entire interval,
152
specify :code:`resolution=None`.
153
For instance, if the region of interest contains a set of variable length
154
gene bodies, the total read count per gene can be obtained using
155
156
.. code-block:: python
157
158
   Cover.create_from_bam('genes',
159
                         bamfiles=samplefile,
160
                         roi=geneannot,
161
                         resolution=None)
162
163
It is also possible to use :code:`resolution=None` in conjunction with e.g. :code:`binsize=200`
164
which would have the same effect as chosing :code:`binsize=resolution=200`.
165
166
Whenever we deal with :code:`resolution>1`, an aggregation operation needs to be performed
167
to summarize the signal intensity across the region. For instance, for
168
:code:`create_from_bam` the reads are summed within each interval.
169
170
For :code:`create_from_bigwig` and :code:`create_from_bed`,
171
it is possible to adjust the collapser. For example, 'mean' or 'sum' aggregation
172
can be applied by name or by handing over a callable according to
173
174
.. code-block:: python
175
176
   import numpy as np
177
178
   Cover.create_from_bigwig('bwtrack',
179
                            bigwigfiles=samplefile,
180
                            roi=roi,
181
                            resolution=50,
182
                            collapser='mean')
183
184
   Cover.create_from_bigwig('bwtrack',
185
                            bigwigfiles=samplefile,
186
                            roi=roi,
187
                            resolution=50,
188
                            collapser=np.sum)
189
190
191
Moreover, more specialized aggregations may
192
require a custom collaper function. In that case,
193
it is important to note that the function expects a 3D numpy array and
194
the aggragation should be performed across the second dimension.
195
For example
196
197
.. code-block:: python
198
199
   def custom_collapser(numpyarray):
200
201
      # Initially, the dimensions of numpyarray correspond to
202
      # (intervallength // resolution, resolution, strand)
203
204
      numpyarray = numpyarray.sum(axis=1)
205
206
      # Subsequently, we return the array of shape
207
      # (intervallength // resolution, strand)
208
209
      return numpyarray
210
211
212
Caching
213
--------
214
215
The construction, including loading and preprocessing,
216
of a genomic dataset might require a significant amount of time.
217
In order to avoid having to create the coverage profiles each time you want
218
to use them, they can be cached and quickly reloaded
219
later.
220
Caching can be activated via the options :code:`cache=True`.
221
When caching is required, janggu will check for changes in the
222
file content, file composition and various dataset specific argument
223
(e.g. binsize, resolution) by constructing a SHA256. The dataset will
224
be loaded or reloaded from scratch if the determined hash does not exist.
225
226
Example:
227
228
.. code:: python
229
230
   # load hg19 if the cache file does not exist yet, otherwise
231
   # reload it.
232
   Bioseq.create_from_refgenome('dna', refgenome, order=1, cache=True)
233
234
235
Dataset storage
236
---------------
237
238
Storage option
239
==============
240
Depending on the structure of the dataset, the required memory to store the data
241
and the available memory on your machine, different storage options are available
242
for the genomic datasets, including **numpy array**, as **sparse array** or as **hdf5 dataset**.
243
To this end, :code:`create_from_bam`, :code:`create_from_bigwig`,
244
:code:`create_from_bed`, :code:`create_from_seq`
245
and :code:`create_from_refgenome` expose the `storage` option, which may be 'ndarray',
246
'sparse' or 'hdf5', respectively.
247
248
'ndarray' amounts to perhaps the fastest access time,
249
but also most memory demanding option for storing the data.
250
It might be useful for dense datasets, and relatively small datasets that conveniently
251
fit into memory.
252
253
If the data is sparse, the option `sparse` yields a good compromise between access time
254
and speed. In that case, the data is stored in its compressed sparse form and converted
255
to a dense representation when querying mini-batches.
256
This option may be used to store e.g. genome wide ChIP-seq peaks profiles, if peaks
257
occur relatively rarely.
258
259
Finally, if the data is too large to be kept in memory, the option
260
`hdf5` allows to consume the data directly from disk. While,
261
the access time for processing data from hdf5 files may be higher,
262
it allows to processing huge datasets with a small amount of RAM in your machine.
263
264
Whole and partial genome storage
265
================================
266
267
:code:`Cover` and :code:`Bioseq` further allow to maintain coverage and sequence information
268
from the entire genome or only the part that is actively consumed during training.
269
This option can be configured by :code:`store_whole_genome=True/False`.
270
271
In most situations, the user may find it convenient to set `store_whole_genome=False`.
272
In that case, when loading :code:`Cover` and :code:`Bioseq` only information overlapping
273
the region of interest will be gathered. The advantage of this would be not to have
274
to store an overhead of information when only a small part of the genome is of interest
275
for consumption.
276
277
On the other hand, `store_whole_genome=True` might be an advantage
278
for the following purposes:
279
280
1. If a large part of the genome is consumed for training/evaluation
281
2. If in addition the `stepsize` for traversing the genome is smaller than `binsize`, in which case mutually overlapping intervals do not have to be stored redundantly.
282
3. It simplifies sharing of the same genomic array for different tasks. For example, during training and testing different parts of the same genomic array may be consumed.
283
284
285
286
Converting Numpy to Cover
287
-------------------------
288
289
When performing predictions, e.g. with a keras model,
290
the output corresponds to an ordinary numpy array.
291
In order to reestablish the association of the predicted values
292
with the genomic coordinates **Cover** exposes the constructor: `create_from_array`.
293
Upon invocation, a new :code:`Cover` object is composed that holds the predicted values.
294
These predictions may subsequently be illustrated via `plotGenomeTrack` or exported to a BIGWIG file.
295
296
297
Evaluation features
298
----------------------------
299
300
:code:`Cover` objects may be exported as BIGWIG files. Accordingly,
301
for each condition in the :code:`Cover` a file will be created.
302
303
It is also possible to illustrate predictions in terms of
304
a genome browser-like plot using `plotGenomeTrack`, allowing to interactively explore
305
prediction scores (perhaps in comparison with the true labels) or
306
feature activities of the internal layers of a neural net.
307
`plotGenomeTrack` return a matplotlib figure that can be stored into a file
308
using native matplotlib functionality.
309
310
311
Rearranging channel dimensions
312
------------------------------
313
314
Depending on the deep learning library that is used, the dimensionality
315
of the tensors need to be set up in a specific order.
316
For example, tensorflow expects the channel to be represented by the last
317
dimension, while theano or pytorch expect the channel at the first dimension.
318
With the option `channel_last=True/False` it is possible to configure the output
319
dimensionality of :code:`Cover` and :code:`Bioseq`.
320
321
Wrapper Datasets
322
----------------
323
324
A Cover object is represents a 4D object. However, sometimes one or more
325
dimensions of Cover might be single dimensional (e.g. containing only one element).
326
These dimensions can be dropped using :code:`ReduceDim`.
327
For example :code:`ReduceDim(cover)`.
328
329
330
Different views datasets
331
------------------------
332
333
Suppose you already have loaded DNA sequence from a reference genome
334
and you want to use a different parts of it
335
for training and validating the model performance.
336
This is achieved by the view mechanism, which allows to
337
reuse the same dataset by instantiating views that reading out different subsets.
338
339
For example, a view constituting the training and test set, respectively.
340
341
.. code-block:: python
342
343
    # union ROI for training and test set.
344
    ROI_FILE = resource_filename('janggu', 'resources/roi.bed')
345
    ROI_TRAIN_FILE = resource_filename('janggu', 'resources/roi_train.bed')
346
    ROI_TEST_FILE = resource_filename('janggu', 'resources/roi_test.bed')
347
348
    DNA = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,
349
                                       roi=ROI_FILE,
350
                                       binsize=200,
351
                                       store_whole_genome=True)
352
353
    DNA_TRAIN = view(DNA, ROI_TRAIN_FILE)
354
    DNA_TEST = view(DNA, ROI_TEST_FILE)
355
356
357
Since underneath the actual dataset is just referenced rather than copied,
358
the memory footprint won't increase. It just allows to read out different parts
359
of the genome.
360
361
An example is illustrated in the `using view notebook <https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/janggu_convnet_examples_with_view.ipynb>`_.
362
363
364
Randomized dataset
365
------------------
366
367
In order to achieve good predictive performances,
368
it is recommended to randomize the mini-batches  during model fitting.
369
This is usually achieved by specifying `shuffle=True` in the fit method.
370
371
However, when using HDF5 dataset, this approach may be prohibitively slow due
372
to the limitations that data from HDF5 files need to be accessed in chunks
373
rather than in random access fashion.
374
375
In order to overcome this issue, it is possible to randomize the dataset
376
already during loading time such that the data can be consumed later
377
by reading coherent chunks by setting  `shuffle=False`.
378
379
For example, randomization is induced by specifying an integer-valued
380
:code:`random_state` as in the example below
381
382
.. code-block:: python
383
384
    DNA = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,
385
                                       roi=ROI_TRAIN_FILE,
386
                                       binsize=200,
387
                                       storage='hdf5',
388
                                       cache=True,
389
                                       store_whole_genome=False,
390
                                       random_state=43)
391
392
For this option to be effective and correct, all datasets consumed during
393
e.g. training need to be provided with the same :code:`random_state` value.
394
Furthermore, the HDF5 file needs to be stored with :code:`store_whole_genome=False`,
395
since data storage is not affected by the random_state when the entire genome
396
is stored.
397
An example is illustrated in the `using hdf5 notebook <https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/janggu_convnet_examples_with_hdf5.ipynb>`_.
398
399
==============================
400
Output directory configuration
401
==============================
402
403
Optionally, janggu produces various kinds of output files, including cache files
404
for the datasets, log files for monitoring the training / evaluation procedure,
405
stored model parameters or summary output files about the evaluation performance.
406
407
The root directory specifying the janggu output location can be configured
408
via setting the environment variable :code:`JANGGU_OUTPUT`.
409
This might be done in the following ways:
410
411
Setting the directory globally::
412
413
   export JANGGU_OUTPUT='/output/dir'
414
415
on startup of the script::
416
417
  JANGGU_OUTPUT='/output/dir' python classify.py
418
419
or inside your model script using
420
421
.. code:: python
422
423
   import os
424
   os.environ['JANGGU_OUTPUT']='/output/dir'
425
426
If  :code:`JANGGU_OUTPUT` is not set, root directory will be set
427
to :code:`/home/user/janggu_results`.