|
a |
|
b/docs/storage.rst |
|
|
1 |
.. _storage: |
|
|
2 |
|
|
|
3 |
================ |
|
|
4 |
Genomic Datasets |
|
|
5 |
================ |
|
|
6 |
|
|
|
7 |
One of the central features of Janggu are the genomic datasets :code:`Cover` and |
|
|
8 |
:code:`Bioseq`. On the one hand, they allow |
|
|
9 |
quick and flexible access to genomics data, including **FASTA**, |
|
|
10 |
**BAM**, **BIGWIG**, **BED** and **GFF** file formats, which bridges the gap |
|
|
11 |
between the data being present in raw file formats |
|
|
12 |
and the numpy inputs required for python-based deep learning models (e.g. keras). |
|
|
13 |
On the other hand, predictions from a deep learning library again are in numpy |
|
|
14 |
format. Janggu facilitates a convertion between numpy arrays and :code:`Cover` objects |
|
|
15 |
in order to associate the predictions with the respective genomic coordinates. |
|
|
16 |
Finally, coverage information may be exported to BIGWIG or inspected directly |
|
|
17 |
via genome browser-like plots. |
|
|
18 |
|
|
|
19 |
|
|
|
20 |
General principle of the Genomics Datasets |
|
|
21 |
------------------------------------------ |
|
|
22 |
Internally, the genomics datasets maintains coverage or |
|
|
23 |
sequence type of information along with the associated genomic intervals. |
|
|
24 |
Externally, the datasets behave similar to a numpy array. This |
|
|
25 |
makes it possible to directly consume the datasets using keras, for instance. |
|
|
26 |
|
|
|
27 |
In this section we briefly describe the internal organisation of these datasets. |
|
|
28 |
The classes :code:`Cover` and :code:`Bioseq` maintain a |
|
|
29 |
:code:`GenomicArray` and a :code:`GenomicIndexer` object. |
|
|
30 |
:code:`GenomicArray` is a general data structure that holds numeric |
|
|
31 |
information about the genome. For instance, read count coverage. |
|
|
32 |
It can be accessed via |
|
|
33 |
genomic coordinates (e.g. chromosome name, start and end) and returns |
|
|
34 |
the respective data as numpy array. |
|
|
35 |
The :code:`GenomicIndexer` maintains a list of genomic coordinates, |
|
|
36 |
which should be traversed during training/evaluation. |
|
|
37 |
The GenomicIndexer can be accessed by an integer-valued |
|
|
38 |
index which returns the associated genomic coordinates. |
|
|
39 |
|
|
|
40 |
When querying the 'i'-th region from :code:`Cover` or :code:`Bioseq`, the index is passed |
|
|
41 |
to the :code:`GenomicIndexer` which yields a genomic coordinates |
|
|
42 |
that is passed on to the :code:`GenomicArray`. |
|
|
43 |
The result is returned in numpy format. |
|
|
44 |
Similarly, the dataset objects |
|
|
45 |
also support slicing and indexing via a list of indices, which is usually relevant |
|
|
46 |
when using mini-batch learning. |
|
|
47 |
|
|
|
48 |
|
|
|
49 |
Normalization |
|
|
50 |
------------- |
|
|
51 |
Upon creation of a :code:`Cover` object, normalization of the raw data might be require. |
|
|
52 |
For instance, to make coverage tracks comparable across replicates or experiments. |
|
|
53 |
To this end, :code:`create_from_bam`, :code:`create_from_bigwig` |
|
|
54 |
and :code:`create_from_bed` expose a :code:`normalizer` option. |
|
|
55 |
Janggu already implements various normalization methods which can be called by name, |
|
|
56 |
TPM (transcript per million) normalization. For instance, using |
|
|
57 |
|
|
|
58 |
.. code-block:: python |
|
|
59 |
|
|
|
60 |
Cover.create_from_bam('track', bamfiles=samplefile, roi=roi, normalizer='tpm') |
|
|
61 |
|
|
|
62 |
Other preprocessing and normalization options are: :code:`zscore`, :code:`zscorelog`, :code:`binsizenorm` and :code:`perctrim`. |
|
|
63 |
The latter two apply normalization for read depth and trimming the signal intensities at the 99%-ile. |
|
|
64 |
|
|
|
65 |
Normalizers can also be applied via callables and/or in combination with other transformations. |
|
|
66 |
For instance, suppose we want to trim the outliers at the 95%-tile instead and |
|
|
67 |
subsequently apply the z-score transformation then we could use |
|
|
68 |
|
|
|
69 |
.. code-block:: python |
|
|
70 |
|
|
|
71 |
from janggu.data import PercentileTrimming |
|
|
72 |
from janggu.data import ZScore |
|
|
73 |
|
|
|
74 |
Cover.create_from_bam('track', bamfiles=samplefile, roi=roi, |
|
|
75 |
normalizer=[PercentileTrimming(95), ZScore()]) |
|
|
76 |
|
|
|
77 |
|
|
|
78 |
It might be necessary to evaluate the normalization parameter on one dataset and apply the same |
|
|
79 |
transformation on other datasets. For instance, in the case of the ZScore, we might want to keep |
|
|
80 |
the mean and standard deviation that was obtained from, say the training set, and reuse the |
|
|
81 |
to normalize the test set. |
|
|
82 |
This is possible by just creating a zscore object that is used multiple times. |
|
|
83 |
At the first invokation the mean and standard deviation are determine and the transformation |
|
|
84 |
is applied. Subsequently, the zscore is determined using the predetermined mean and standard deviation. |
|
|
85 |
For example: |
|
|
86 |
|
|
|
87 |
.. code-block:: python |
|
|
88 |
|
|
|
89 |
from janggu.data import ZScore |
|
|
90 |
|
|
|
91 |
zscore = ZScore() |
|
|
92 |
|
|
|
93 |
# First, mean and std will be determined. |
|
|
94 |
# Then zscore transformation is applied. |
|
|
95 |
Cover.create_from_bam('track_train', bamfiles=samplefile, roi=roitrain, |
|
|
96 |
normalizer=[zscore]) |
|
|
97 |
|
|
|
98 |
# Subsequently, zscore transformation is applied with |
|
|
99 |
# the same mean and std determined from the training set. |
|
|
100 |
Cover.create_from_bam('track_test', bamfiles=samplefile, roi=roitest, |
|
|
101 |
normalizer=[zscore]) |
|
|
102 |
|
|
|
103 |
In case a different normalization procedure is required that is not contained in janggu, |
|
|
104 |
it is possible to define a custom_normalizer as follows: |
|
|
105 |
|
|
|
106 |
.. code-block:: python |
|
|
107 |
|
|
|
108 |
def custom_normalizer(genomicarray): |
|
|
109 |
|
|
|
110 |
# perform normalization genomicarray |
|
|
111 |
|
|
|
112 |
return genomicarray |
|
|
113 |
|
|
|
114 |
The currently implemented normalizers may be a good starting point |
|
|
115 |
for this purpose. |
|
|
116 |
|
|
|
117 |
|
|
|
118 |
Granularity of the coverage |
|
|
119 |
---------------------------- |
|
|
120 |
|
|
|
121 |
Depending on the applications, different granularity of the |
|
|
122 |
coverage data might be required. For instance, one might be interested in reading out |
|
|
123 |
nucleotide-resolution coverage for one purpose or 50 base-pair resolution bins for another. |
|
|
124 |
Furthermore, in some cases the signal of variable size regions might be of interest. For |
|
|
125 |
example, the read counts across the gene bodies, to measure gene expression levels. |
|
|
126 |
|
|
|
127 |
These adjustments can be made when invoking :code:`create_from_bam`, |
|
|
128 |
:code:`create_from_bigwig` and :code:`create_from_bed` |
|
|
129 |
using an appropriate region of interest ROI file in conjunction |
|
|
130 |
with specifying the :code:`resolution` and :code:`collapser` parameter. |
|
|
131 |
|
|
|
132 |
First, we the resolution parameter allows to the coverage granularity. |
|
|
133 |
For example, base-pair and 50-base-pair resolution would be possible using |
|
|
134 |
|
|
|
135 |
.. code-block:: python |
|
|
136 |
|
|
|
137 |
Cover.create_from_bam('track', bamfiles=samplefile, roi=roi, |
|
|
138 |
resolution=1) |
|
|
139 |
|
|
|
140 |
Cover.create_from_bam('track', bamfiles=samplefile, roi=roi, |
|
|
141 |
resolution=50) |
|
|
142 |
|
|
|
143 |
.. sidebar:: janggu-trim |
|
|
144 |
|
|
|
145 |
When using N-based pair resolution with :code:`n>1` in conjunction with the |
|
|
146 |
option :code:`store_whole_genome=True`, then the region of interest starts |
|
|
147 |
and ends must be divisible by the resolution. Otherwise, undesired rounding |
|
|
148 |
effect might occur. This can be achieved by using :code:`janggu-trim`. |
|
|
149 |
See Section command line tools. |
|
|
150 |
|
|
|
151 |
In case the signal intensity should be summarized across the entire interval, |
|
|
152 |
specify :code:`resolution=None`. |
|
|
153 |
For instance, if the region of interest contains a set of variable length |
|
|
154 |
gene bodies, the total read count per gene can be obtained using |
|
|
155 |
|
|
|
156 |
.. code-block:: python |
|
|
157 |
|
|
|
158 |
Cover.create_from_bam('genes', |
|
|
159 |
bamfiles=samplefile, |
|
|
160 |
roi=geneannot, |
|
|
161 |
resolution=None) |
|
|
162 |
|
|
|
163 |
It is also possible to use :code:`resolution=None` in conjunction with e.g. :code:`binsize=200` |
|
|
164 |
which would have the same effect as chosing :code:`binsize=resolution=200`. |
|
|
165 |
|
|
|
166 |
Whenever we deal with :code:`resolution>1`, an aggregation operation needs to be performed |
|
|
167 |
to summarize the signal intensity across the region. For instance, for |
|
|
168 |
:code:`create_from_bam` the reads are summed within each interval. |
|
|
169 |
|
|
|
170 |
For :code:`create_from_bigwig` and :code:`create_from_bed`, |
|
|
171 |
it is possible to adjust the collapser. For example, 'mean' or 'sum' aggregation |
|
|
172 |
can be applied by name or by handing over a callable according to |
|
|
173 |
|
|
|
174 |
.. code-block:: python |
|
|
175 |
|
|
|
176 |
import numpy as np |
|
|
177 |
|
|
|
178 |
Cover.create_from_bigwig('bwtrack', |
|
|
179 |
bigwigfiles=samplefile, |
|
|
180 |
roi=roi, |
|
|
181 |
resolution=50, |
|
|
182 |
collapser='mean') |
|
|
183 |
|
|
|
184 |
Cover.create_from_bigwig('bwtrack', |
|
|
185 |
bigwigfiles=samplefile, |
|
|
186 |
roi=roi, |
|
|
187 |
resolution=50, |
|
|
188 |
collapser=np.sum) |
|
|
189 |
|
|
|
190 |
|
|
|
191 |
Moreover, more specialized aggregations may |
|
|
192 |
require a custom collaper function. In that case, |
|
|
193 |
it is important to note that the function expects a 3D numpy array and |
|
|
194 |
the aggragation should be performed across the second dimension. |
|
|
195 |
For example |
|
|
196 |
|
|
|
197 |
.. code-block:: python |
|
|
198 |
|
|
|
199 |
def custom_collapser(numpyarray): |
|
|
200 |
|
|
|
201 |
# Initially, the dimensions of numpyarray correspond to |
|
|
202 |
# (intervallength // resolution, resolution, strand) |
|
|
203 |
|
|
|
204 |
numpyarray = numpyarray.sum(axis=1) |
|
|
205 |
|
|
|
206 |
# Subsequently, we return the array of shape |
|
|
207 |
# (intervallength // resolution, strand) |
|
|
208 |
|
|
|
209 |
return numpyarray |
|
|
210 |
|
|
|
211 |
|
|
|
212 |
Caching |
|
|
213 |
-------- |
|
|
214 |
|
|
|
215 |
The construction, including loading and preprocessing, |
|
|
216 |
of a genomic dataset might require a significant amount of time. |
|
|
217 |
In order to avoid having to create the coverage profiles each time you want |
|
|
218 |
to use them, they can be cached and quickly reloaded |
|
|
219 |
later. |
|
|
220 |
Caching can be activated via the options :code:`cache=True`. |
|
|
221 |
When caching is required, janggu will check for changes in the |
|
|
222 |
file content, file composition and various dataset specific argument |
|
|
223 |
(e.g. binsize, resolution) by constructing a SHA256. The dataset will |
|
|
224 |
be loaded or reloaded from scratch if the determined hash does not exist. |
|
|
225 |
|
|
|
226 |
Example: |
|
|
227 |
|
|
|
228 |
.. code:: python |
|
|
229 |
|
|
|
230 |
# load hg19 if the cache file does not exist yet, otherwise |
|
|
231 |
# reload it. |
|
|
232 |
Bioseq.create_from_refgenome('dna', refgenome, order=1, cache=True) |
|
|
233 |
|
|
|
234 |
|
|
|
235 |
Dataset storage |
|
|
236 |
--------------- |
|
|
237 |
|
|
|
238 |
Storage option |
|
|
239 |
============== |
|
|
240 |
Depending on the structure of the dataset, the required memory to store the data |
|
|
241 |
and the available memory on your machine, different storage options are available |
|
|
242 |
for the genomic datasets, including **numpy array**, as **sparse array** or as **hdf5 dataset**. |
|
|
243 |
To this end, :code:`create_from_bam`, :code:`create_from_bigwig`, |
|
|
244 |
:code:`create_from_bed`, :code:`create_from_seq` |
|
|
245 |
and :code:`create_from_refgenome` expose the `storage` option, which may be 'ndarray', |
|
|
246 |
'sparse' or 'hdf5', respectively. |
|
|
247 |
|
|
|
248 |
'ndarray' amounts to perhaps the fastest access time, |
|
|
249 |
but also most memory demanding option for storing the data. |
|
|
250 |
It might be useful for dense datasets, and relatively small datasets that conveniently |
|
|
251 |
fit into memory. |
|
|
252 |
|
|
|
253 |
If the data is sparse, the option `sparse` yields a good compromise between access time |
|
|
254 |
and speed. In that case, the data is stored in its compressed sparse form and converted |
|
|
255 |
to a dense representation when querying mini-batches. |
|
|
256 |
This option may be used to store e.g. genome wide ChIP-seq peaks profiles, if peaks |
|
|
257 |
occur relatively rarely. |
|
|
258 |
|
|
|
259 |
Finally, if the data is too large to be kept in memory, the option |
|
|
260 |
`hdf5` allows to consume the data directly from disk. While, |
|
|
261 |
the access time for processing data from hdf5 files may be higher, |
|
|
262 |
it allows to processing huge datasets with a small amount of RAM in your machine. |
|
|
263 |
|
|
|
264 |
Whole and partial genome storage |
|
|
265 |
================================ |
|
|
266 |
|
|
|
267 |
:code:`Cover` and :code:`Bioseq` further allow to maintain coverage and sequence information |
|
|
268 |
from the entire genome or only the part that is actively consumed during training. |
|
|
269 |
This option can be configured by :code:`store_whole_genome=True/False`. |
|
|
270 |
|
|
|
271 |
In most situations, the user may find it convenient to set `store_whole_genome=False`. |
|
|
272 |
In that case, when loading :code:`Cover` and :code:`Bioseq` only information overlapping |
|
|
273 |
the region of interest will be gathered. The advantage of this would be not to have |
|
|
274 |
to store an overhead of information when only a small part of the genome is of interest |
|
|
275 |
for consumption. |
|
|
276 |
|
|
|
277 |
On the other hand, `store_whole_genome=True` might be an advantage |
|
|
278 |
for the following purposes: |
|
|
279 |
|
|
|
280 |
1. If a large part of the genome is consumed for training/evaluation |
|
|
281 |
2. If in addition the `stepsize` for traversing the genome is smaller than `binsize`, in which case mutually overlapping intervals do not have to be stored redundantly. |
|
|
282 |
3. It simplifies sharing of the same genomic array for different tasks. For example, during training and testing different parts of the same genomic array may be consumed. |
|
|
283 |
|
|
|
284 |
|
|
|
285 |
|
|
|
286 |
Converting Numpy to Cover |
|
|
287 |
------------------------- |
|
|
288 |
|
|
|
289 |
When performing predictions, e.g. with a keras model, |
|
|
290 |
the output corresponds to an ordinary numpy array. |
|
|
291 |
In order to reestablish the association of the predicted values |
|
|
292 |
with the genomic coordinates **Cover** exposes the constructor: `create_from_array`. |
|
|
293 |
Upon invocation, a new :code:`Cover` object is composed that holds the predicted values. |
|
|
294 |
These predictions may subsequently be illustrated via `plotGenomeTrack` or exported to a BIGWIG file. |
|
|
295 |
|
|
|
296 |
|
|
|
297 |
Evaluation features |
|
|
298 |
---------------------------- |
|
|
299 |
|
|
|
300 |
:code:`Cover` objects may be exported as BIGWIG files. Accordingly, |
|
|
301 |
for each condition in the :code:`Cover` a file will be created. |
|
|
302 |
|
|
|
303 |
It is also possible to illustrate predictions in terms of |
|
|
304 |
a genome browser-like plot using `plotGenomeTrack`, allowing to interactively explore |
|
|
305 |
prediction scores (perhaps in comparison with the true labels) or |
|
|
306 |
feature activities of the internal layers of a neural net. |
|
|
307 |
`plotGenomeTrack` return a matplotlib figure that can be stored into a file |
|
|
308 |
using native matplotlib functionality. |
|
|
309 |
|
|
|
310 |
|
|
|
311 |
Rearranging channel dimensions |
|
|
312 |
------------------------------ |
|
|
313 |
|
|
|
314 |
Depending on the deep learning library that is used, the dimensionality |
|
|
315 |
of the tensors need to be set up in a specific order. |
|
|
316 |
For example, tensorflow expects the channel to be represented by the last |
|
|
317 |
dimension, while theano or pytorch expect the channel at the first dimension. |
|
|
318 |
With the option `channel_last=True/False` it is possible to configure the output |
|
|
319 |
dimensionality of :code:`Cover` and :code:`Bioseq`. |
|
|
320 |
|
|
|
321 |
Wrapper Datasets |
|
|
322 |
---------------- |
|
|
323 |
|
|
|
324 |
A Cover object is represents a 4D object. However, sometimes one or more |
|
|
325 |
dimensions of Cover might be single dimensional (e.g. containing only one element). |
|
|
326 |
These dimensions can be dropped using :code:`ReduceDim`. |
|
|
327 |
For example :code:`ReduceDim(cover)`. |
|
|
328 |
|
|
|
329 |
|
|
|
330 |
Different views datasets |
|
|
331 |
------------------------ |
|
|
332 |
|
|
|
333 |
Suppose you already have loaded DNA sequence from a reference genome |
|
|
334 |
and you want to use a different parts of it |
|
|
335 |
for training and validating the model performance. |
|
|
336 |
This is achieved by the view mechanism, which allows to |
|
|
337 |
reuse the same dataset by instantiating views that reading out different subsets. |
|
|
338 |
|
|
|
339 |
For example, a view constituting the training and test set, respectively. |
|
|
340 |
|
|
|
341 |
.. code-block:: python |
|
|
342 |
|
|
|
343 |
# union ROI for training and test set. |
|
|
344 |
ROI_FILE = resource_filename('janggu', 'resources/roi.bed') |
|
|
345 |
ROI_TRAIN_FILE = resource_filename('janggu', 'resources/roi_train.bed') |
|
|
346 |
ROI_TEST_FILE = resource_filename('janggu', 'resources/roi_test.bed') |
|
|
347 |
|
|
|
348 |
DNA = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME, |
|
|
349 |
roi=ROI_FILE, |
|
|
350 |
binsize=200, |
|
|
351 |
store_whole_genome=True) |
|
|
352 |
|
|
|
353 |
DNA_TRAIN = view(DNA, ROI_TRAIN_FILE) |
|
|
354 |
DNA_TEST = view(DNA, ROI_TEST_FILE) |
|
|
355 |
|
|
|
356 |
|
|
|
357 |
Since underneath the actual dataset is just referenced rather than copied, |
|
|
358 |
the memory footprint won't increase. It just allows to read out different parts |
|
|
359 |
of the genome. |
|
|
360 |
|
|
|
361 |
An example is illustrated in the `using view notebook <https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/janggu_convnet_examples_with_view.ipynb>`_. |
|
|
362 |
|
|
|
363 |
|
|
|
364 |
Randomized dataset |
|
|
365 |
------------------ |
|
|
366 |
|
|
|
367 |
In order to achieve good predictive performances, |
|
|
368 |
it is recommended to randomize the mini-batches during model fitting. |
|
|
369 |
This is usually achieved by specifying `shuffle=True` in the fit method. |
|
|
370 |
|
|
|
371 |
However, when using HDF5 dataset, this approach may be prohibitively slow due |
|
|
372 |
to the limitations that data from HDF5 files need to be accessed in chunks |
|
|
373 |
rather than in random access fashion. |
|
|
374 |
|
|
|
375 |
In order to overcome this issue, it is possible to randomize the dataset |
|
|
376 |
already during loading time such that the data can be consumed later |
|
|
377 |
by reading coherent chunks by setting `shuffle=False`. |
|
|
378 |
|
|
|
379 |
For example, randomization is induced by specifying an integer-valued |
|
|
380 |
:code:`random_state` as in the example below |
|
|
381 |
|
|
|
382 |
.. code-block:: python |
|
|
383 |
|
|
|
384 |
DNA = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME, |
|
|
385 |
roi=ROI_TRAIN_FILE, |
|
|
386 |
binsize=200, |
|
|
387 |
storage='hdf5', |
|
|
388 |
cache=True, |
|
|
389 |
store_whole_genome=False, |
|
|
390 |
random_state=43) |
|
|
391 |
|
|
|
392 |
For this option to be effective and correct, all datasets consumed during |
|
|
393 |
e.g. training need to be provided with the same :code:`random_state` value. |
|
|
394 |
Furthermore, the HDF5 file needs to be stored with :code:`store_whole_genome=False`, |
|
|
395 |
since data storage is not affected by the random_state when the entire genome |
|
|
396 |
is stored. |
|
|
397 |
An example is illustrated in the `using hdf5 notebook <https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/janggu_convnet_examples_with_hdf5.ipynb>`_. |
|
|
398 |
|
|
|
399 |
============================== |
|
|
400 |
Output directory configuration |
|
|
401 |
============================== |
|
|
402 |
|
|
|
403 |
Optionally, janggu produces various kinds of output files, including cache files |
|
|
404 |
for the datasets, log files for monitoring the training / evaluation procedure, |
|
|
405 |
stored model parameters or summary output files about the evaluation performance. |
|
|
406 |
|
|
|
407 |
The root directory specifying the janggu output location can be configured |
|
|
408 |
via setting the environment variable :code:`JANGGU_OUTPUT`. |
|
|
409 |
This might be done in the following ways: |
|
|
410 |
|
|
|
411 |
Setting the directory globally:: |
|
|
412 |
|
|
|
413 |
export JANGGU_OUTPUT='/output/dir' |
|
|
414 |
|
|
|
415 |
on startup of the script:: |
|
|
416 |
|
|
|
417 |
JANGGU_OUTPUT='/output/dir' python classify.py |
|
|
418 |
|
|
|
419 |
or inside your model script using |
|
|
420 |
|
|
|
421 |
.. code:: python |
|
|
422 |
|
|
|
423 |
import os |
|
|
424 |
os.environ['JANGGU_OUTPUT']='/output/dir' |
|
|
425 |
|
|
|
426 |
If :code:`JANGGU_OUTPUT` is not set, root directory will be set |
|
|
427 |
to :code:`/home/user/janggu_results`. |