Diff of /docs/tutorial.rst [000000] .. [d7cf27]

Switch to unified view

a b/docs/tutorial.rst
1
=========
2
Tutorial
3
=========
4
5
This tutorial is split in three parts.
6
7
Part I treats the Genomic Dataset
8
that are available through Janggu which can be directly consumed
9
by your keras model.
10
The tutorial illustates how to access genomics
11
data from various widely used file formats, including FASTA, BAM, BIGWIG, BED
12
and GFF for the purpose of using them as input to a deep learning application.
13
It illustrates a range of parameters to adapt the read out of genomics data
14
and it shows how the predictions or feature activities of a neural network
15
in numpy format can be converted to a genomic coverage representation
16
that can in turn be exported to BIGWIG file format
17
or visualized directly via a genome browser-like plot.
18
19
Part II treats utilities for defining a neural networks based on keras.
20
21
Part III illustrates Janggu's evaluation utilities.
22
23
Complementary to this tutorial, the janggu repository contains
24
a number of jupyter notebooks that illustrate for example with keras or
25
sklearn:
26
27
.. _notebook_tutorials:
28
29
+----------------------------------------------------+
30
| Example notebooks                                  |
31
+====================================================+
32
| `keras cnn example`_                               |
33
+----------------------------------------------------+
34
| `keras cnn example with JangguSequence`_           |
35
+----------------------------------------------------+
36
| `sklearn example`_                                 |
37
+----------------------------------------------------+
38
| `pytorch example`_                                 |
39
+----------------------------------------------------+
40
| `janggu example I`_                                |
41
+----------------------------------------------------+
42
| `janggu example II`_                               |
43
+----------------------------------------------------+
44
| `reusing datasets with view`_                      |
45
+----------------------------------------------------+
46
| `hyperparameter optimization`_                     |
47
+----------------------------------------------------+
48
| `randomizing HDF5 data`_                           |
49
+----------------------------------------------------+
50
| `variant effect prediction`_                       |
51
+----------------------------------------------------+
52
| `variant effect prediction - part II`_             |
53
+----------------------------------------------------+
54
| `plotting genome coverage`_                        |
55
+----------------------------------------------------+
56
57
.. _`keras cnn example`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/keras_convnet_example.ipynb
58
.. _`keras cnn example with JangguSequence`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/keras_convnet_example_w_sequence.ipynb
59
.. _`sklearn example`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/sklearn_example_with_kmers.ipynb
60
.. _`pytorch example`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/pytorch_convnet_example.ipynb
61
.. _`janggu example I`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/classify_nucleotide_sequences.ipynb
62
.. _`janggu example II`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/janggu_convnet_examples.ipynb
63
.. _`reusing datasets with view`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/janggu_convnet_examples_with_hdf5.ipynb
64
.. _`randomizing HDF5 data`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/janggu_convnet_examples_with_hdf5.ipynb
65
.. _`variant effect prediction`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/variant_effect_prediction.ipynb
66
.. _`variant effect prediction - part II`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/variant_effect_prediction-part2.ipynb
67
.. _`plotting genome coverage`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/plot_coverage.ipynb
68
.. _`hyperparameter optimization`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/hyperparameter_optimization_w_janggu.ipynb
69
70
71
Furthermore, use cases
72
for predicting JunD binding, training and adapting published genomics models
73
as well as a regression model example are demonstrated in
74
the supplementary repository: `Janggu use cases`_
75
76
77
.. _`Janggu use cases`: https://github.com/wkopp/janggu_usecases
78
79
80
81
Part I) Introduction to Genomic Datasets
82
-----------------------------------------
83
84
.. sidebar:: Genomic Datasets
85
86
   Most of the parameters are consistent across
87
   :class:`Bioseq` and :class:`Cover`.
88
89
:mod:`janggu.data` provides Dataset classes
90
that can be used for
91
training and evaluating neural networks.
92
Of particular importance are the Genomics-specific dataset,
93
:class:`Bioseq` and :class:`Cover` which
94
allow easy access to genomics data,
95
including DNA sequences or coverage information.
96
Apart from accessing raw genomics data, :code:`Janggu`
97
also facilitates a method for converting an ordinary
98
numpy array (e.g. predictions obtained from a neural net)
99
to a :code:`Cover` object. This enables the user to export the predictions
100
as BIGWIG format or interactively plot genome browser tracks.
101
In this tutorial, we demonstrate some of the key functionality of
102
Janggu. Further details are available in :ref:`storage`
103
and :ref:`reference-label`.
104
105
Bioseq
106
^^^^^^^^^^
107
The :class:`Bioseq` can be used to load nucleotide
108
or protein sequence data from
109
fasta files or from a reference genome
110
along with a set of genomic coordinates defining the region of interest (ROI).
111
The class facilitates access the
112
*one-hot encoding* representation of the sequences.
113
Specifically,
114
the *one-hot encoding* is represented as a
115
4D array with dimensions corresponding
116
to :code:`(region, region_length, 1, alphabet_size)`.
117
The Bioseq offers a number of features:
118
119
1. Strand-specific sequence extraction (if DNA sequences are extracted from the reference genome)
120
2. Higher-order one-hot encoding, e.g. di-nucleotide based
121
122
Sequences can be loaded in two ways: using
123
:code:`Bioseq.create_from_seq` or
124
:code:`Bioseq.create_from_refgenome`.
125
The former constructor method can be used to load
126
DNA or protein sequences from fasta files directly
127
or from as list of :code:`Bio.SeqRecord.SeqRecord` entries.
128
An example is shown below:
129
130
.. code-block:: python
131
132
   from pkg_resources import resource_filename
133
   from janggu.data import Bioseq
134
135
   fasta_file = resource_filename('janggu',
136
                                  'resources/sample.fa')
137
138
   dna = Bioseq.create_from_seq(name='dna',
139
                                fastafile=fasta_file)
140
141
   # there are 3897 sequences in the in sample.fa
142
   len(dna)
143
144
   # Each sequence is 200 bp of length
145
   dna.shape  # is (3897, 200, 1, 4)
146
147
   # One-hot encoding for the first 10 bases of the first region
148
   dna[0][0, :10, 0, :]
149
150
   #array([[0, 1, 0, 0],
151
   #       [1, 0, 0, 0],
152
   #       [0, 1, 0, 0],
153
   #       [1, 0, 0, 0],
154
   #       [0, 0, 1, 0],
155
   #       [0, 1, 0, 0],
156
   #       [1, 0, 0, 0],
157
   #       [0, 0, 1, 0],
158
   #       [1, 0, 0, 0],
159
   #       [0, 0, 1, 0]], dtype=int8)
160
161
Furthermore, it is possible to trim variable sequence length using
162
the :code:`fixedlen` option. If specfied, all sequences will be truncated
163
or zero-padded to length `fixedlen`. For example,
164
165
.. code-block:: python
166
167
   dna = Bioseq.create_from_seq(name='dna',
168
                                fastafile=fasta_file,
169
                                fixedlen=205)
170
171
   # Each sequence is 205 bp of length
172
   dna.shape  # is (3897, 205, 1, 4)
173
174
   # the last 5 position were zero padded
175
   dna[0][0, -6:, 0, :]
176
177
   #array([[1, 0, 0, 0],
178
   #       [0, 0, 0, 0],
179
   #       [0, 0, 0, 0],
180
   #       [0, 0, 0, 0],
181
   #       [0, 0, 0, 0],
182
   #       [0, 0, 0, 0]], dtype=int8)
183
184
185
Alternatively, nucleotide sequences can be
186
obtained from a reference genome directly along with
187
a BED or GFF file that indicates the region of interest (ROI).
188
189
If each interval in the BED-file already corresponds
190
to a 'datapoint' that shall be consumed during training, like it
191
is the case for 'sample_equalsize.bed', the associated DNA sequences
192
can be loaded according to
193
194
.. code-block:: python
195
196
   roi = resource_filename('janggu',
197
                           'resources/sample_equalsize.bed')
198
   refgenome = resource_filename('janggu',
199
                                 'resources/sample_genome.fa')
200
201
   dna = Bioseq.create_from_refgenome(name='dna',
202
                                      refgenome=refgenome,
203
                                      roi=roi)
204
205
   dna.shape  # is (4, 200, 1, 4)
206
   # One-hot encoding of the first 10 nucleotides in region 0
207
   dna[0][0, :10, 0, :]
208
209
   #array([[0, 1, 0, 0],
210
   #       [1, 0, 0, 0],
211
   #       [0, 1, 0, 0],
212
   #       [1, 0, 0, 0],
213
   #       [0, 0, 1, 0],
214
   #       [0, 1, 0, 0],
215
   #       [1, 0, 0, 0],
216
   #       [0, 0, 1, 0],
217
   #       [1, 0, 0, 0],
218
   #       [0, 0, 1, 0]], dtype=int8)
219
220
221
222
Sometimes it is more convenient to provide the ROI
223
as a set of variable-sized broad intervals
224
(e.g. chr1:10000-50000 and chr3:4000-8000)
225
which should be divided into sub-intervals
226
of equal length (e.g. of length 200 bp).
227
This can be achieved
228
by explicitly specifying a desired :code:`binsize`
229
and :code:`stepsize` as shown below:
230
231
.. code-block:: python
232
233
   roi = resource_filename('janggu',
234
                           'resources/sample.bed')
235
236
   # loading non-overlapping intervals
237
   dna = Bioseq.create_from_refgenome(name='dna',
238
                                      refgenome=refgenome,
239
                                      roi=roi,
240
                                      binsize=200,
241
                                      stepsize=200)
242
243
   dna.shape  # is (100, 200, 1, 4)
244
245
   # loading mutually overlapping intervals
246
   dna = Bioseq.create_from_refgenome(name='dna',
247
                                      refgenome=refgenome,
248
                                      roi=roi,
249
                                      binsize=200,
250
                                      stepsize=50)
251
252
   dna.shape  # is (394, 200, 1, 4)
253
254
255
The argument :code:`flank` can be used to extend
256
the intervals up and downstream by a given length
257
258
.. code-block:: python
259
260
   dna = Bioseq.create_from_refgenome(name='dna',
261
                                      refgenome=refgenome,
262
                                      roi=roi,
263
                                      binsize=200,
264
                                      stepsize=200,
265
                                      flank=100)
266
267
   dna.shape  # is (100, 400, 1, 4)
268
269
270
Finally, sequences can be represented using **higher-order**
271
one-hot representation using the :code:`order` argument. An example
272
of a di-nucleotide-based one-hot representation is shown below
273
274
275
.. code-block:: python
276
277
   dna = Bioseq.create_from_refgenome(name='dna',
278
                                      refgenome=refgenome,
279
                                      roi=roi,
280
                                      binsize=200,
281
                                      stepsize=200,
282
                                      order=2)
283
284
   # is (100, 199, 1, 16)
285
   # that is the last dimension represents di-nucleotides
286
   dna.shape
287
288
   dna.conditions
289
   # ['AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT', 'GA', 'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT']
290
291
   dna[0][0, :5, 0, :]
292
293
   #array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
294
   #       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
295
   #       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
296
   #       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
297
   #       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]], dtype=int8)
298
299
300
301
Cover
302
^^^^^
303
:class:`Cover` can be utilized to fetch different kinds of
304
coverage data from commonly used data formats,
305
including BAM, BIGWIG, BED and GFF.
306
Coverage data is stored as a 4D array with dimensions corresponding
307
to :code:`(region, region_length, strand, condition)`.
308
309
The following examples illustrate some use cases for :class:`Cover`,
310
including loading, normalizing coverage data.
311
Additional features are described in the :ref:`reference-label`.
312
313
**Loading read count coverage from BAM files** is supported for
314
single-end and paired-end alignments. For the single-end case
315
reads are counted on the 5'-end and and for paired-end alignments,
316
reads are optionally counted at the mid-points or 5' ends of the first mate.
317
The following example illustrate how to extract base-pair resolution coverage
318
with and without strandedness.
319
320
.. code:: python
321
322
   from janggu.data import Cover
323
324
   bam_file = resource_filename('janggu',
325
                                'resources/sample.bam')
326
327
   roi = resource_filename('janggu',
328
                           'resources/sample.bed')
329
330
   cover = Cover.create_from_bam('read_count_coverage',
331
                                 bamfiles=bam_file,
332
                                 binsize=200,
333
                                 stepsize=200,
334
                                 roi=roi)
335
336
   cover.shape  # is (100, 200, 2, 1)
337
   cover[0]  # coverage of the first region
338
339
   # Coverage regardless of read strandedness
340
   # sums reads from both strand.
341
   cover = Cover.create_from_bam('read_coverage',
342
                                 bamfiles=bam_file,
343
                                 binsize=200,
344
                                 stepsize=200,
345
                                 stranded=False,
346
                                 roi=roi)
347
348
   cover.shape  # is (100, 200, 1, 1)
349
350
351
Sometimes it is desirable to determine the read
352
count coverage in say 50 bp bins which can be
353
controlled by the :code:`resolution` argument.
354
Consequently, note that the second dimension amounts
355
to length 4 using `binsize=200` and `resolution=50` in the following example
356
357
.. code:: python
358
359
   # example with resolution=200 bp
360
   cover = Cover.create_from_bam('read_coverage',
361
                                 bamfiles=bam_file,
362
                                 binsize=200,
363
                                 resolution=50,
364
                                 roi=roi)
365
366
   cover.shape  # is (100, 4, 2, 1)
367
368
369
It might be desired to aggregate reads across entire interval
370
rather than binning the genome to equally sized bins of
371
length :code:`resolution`. An example application for this would
372
be to count reads per possibly variable-size regions (e.g. genes).
373
This can be achived by :code:`resolution=None` which results
374
in the second dimension being collapsed to a length of one.
375
376
.. code:: python
377
378
   # example with resolution=None
379
   cover = Cover.create_from_bam('read_coverage',
380
                                 bamfiles=bam_file,
381
                                 binsize=200,
382
                                 resolution=None,
383
                                 roi=roi)
384
385
   cover.shape  # is (100, 1, 2, 1)
386
387
Similarly, if strandedness is not relevant we may use
388
389
.. code:: python
390
391
   # example with resolution=None without strandedness
392
   cover = Cover.create_from_bam('read_coverage',
393
                                 bamfiles=bam_file,
394
                                 binsize=200,
395
                                 resolution=None,
396
                                 stranded=False,
397
                                 roi=roi)
398
399
   cover.shape  # is (100, 1, 1, 1)
400
401
Finally, it is possible to normalize the coverage profile, e.g.
402
to account for differences in sequencing depth across experiments
403
using the :code:`normalizer` argument
404
405
.. code:: python
406
407
   # example with resolution=None without strandedness
408
   cover = Cover.create_from_bam('read_coverage',
409
                                 bamfiles=bam_file,
410
                                 binsize=200,
411
                                 resolution=None,
412
                                 stranded=False,
413
                                 normalizer='tpm',
414
                                 roi=roi)
415
416
   cover.shape  # is (100, 1, 1, 1)
417
418
More details on alternative normalization
419
options are discussed in :ref:`storage`.
420
421
**Loading signal coverage from BIGWIG files**
422
can be achieved analogously:
423
424
.. code-block:: python
425
426
   roi = resource_filename('janggu',
427
                           'resources/sample.bed')
428
   bw_file = resource_filename('janggu',
429
                               'resources/sample.bw')
430
431
   cover = Cover.create_from_bigwig('bigwig_coverage',
432
                                    bigwigfiles=bw_file,
433
                                    roi=roi,
434
                                    binsize=200,
435
                                    stepsize=200)
436
437
   cover.shape  # is (100, 200, 1, 1)
438
439
440
When applying signal aggregation using e.g :code:`resolution=50` or :code:`resolution=None`,
441
additionally, the aggregation method can be specified using
442
the :code:`collapser` argument.
443
For example, in order to represent the resolution sized
444
bin by its mean signal the following snippet may be used:
445
446
.. code-block:: python
447
448
   cover = Cover.create_from_bigwig('bigwig_coverage',
449
                                    bigwigfiles=bw_file,
450
                                    roi=roi,
451
                                    binsize=200,
452
                                    resolution=None,
453
                                    collapser='mean')
454
455
   cover.shape  # is (100, 1, 1, 1)
456
457
458
More details on alternative collapse
459
options are discussed in :ref:`storage`.
460
461
462
**Coverage from a BED files** is largely analogous to extracting coverage
463
information from BAM or BIGWIG files, but in addition the :code:`mode` option
464
enables to interpret BED-like files in various ways:
465
466
1. :code:`mode='binary'` Presence/Absence mode interprets the ROI as the union of positive and negative cases in a binary classification setting and regions contained in :code:`bedfiles` as positive examples.
467
2. :code:`mode='score'` reads out the real-valued score field value from the associated regions.
468
3. :code:`mode='score_category'` transforms integer-valued scores into a categorical one-hot representation.
469
4. :code:`mode='name_category'` transforms the name field into a categorical one-hot representation.
470
5. :code:`mode='bedgraph'` reads in the score from a file in bedgraph format.
471
472
Examples of loading data from a BED file are shown below
473
474
.. code-block:: python
475
476
   roi = resource_filename('janggu',
477
                           'resources/sample.bed')
478
   score_file = resource_filename('janggu',
479
                                  'resources/scored_sample.bed')
480
481
   # binary mode (default)
482
   cover = Cover.create_from_bed('binary_coverage',
483
                                 bedfiles=score_file,
484
                                 roi=roi,
485
                                 binsize=200,
486
                                 stepsize=200,
487
                                 collapser='max',
488
                                 resolution=None)
489
490
   cover.shape
491
   # (100, 1, 1, 1)
492
493
   cover[4]
494
   # array([[[[1.]]]])
495
496
   # score mode
497
   cover = Cover.create_from_bed('score_coverage',
498
                                 bedfiles=score_file,
499
                                 roi=roi,
500
                                 binsize=200,
501
                                 stepsize=200,
502
                                 resolution=None,
503
                                 collapser='max',
504
                                 mode='score')
505
506
   cover.shape
507
   # (100, 1, 1, 1)
508
509
   cover[4]
510
   # array([[[[5.]]]])
511
512
513
   # scoreclass (or categorical) mode
514
   # Interprets the integer-valued score as class-label,
515
   # which will then be one-hot encoded.
516
   cover = Cover.create_from_bed('cat_coverage',
517
                                 bedfiles=score_file,
518
                                 roi=roi,
519
                                 binsize=200,
520
                                 stepsize=200,
521
                                 resolution=None,
522
                                 collapser='max',
523
                                 mode='score_category')
524
525
   # there are 4 categories
526
   cover.shape
527
   # (100, 1, 1, 4)
528
529
   # category names
530
   cover.conditions
531
   # ['1', '2', '4', '5']
532
533
   cover[4]
534
   # array([[[[0., 0., 0., 1.]]]])
535
536
   # nameclass mode
537
   # Interprets the name field as class-label.
538
   # The class labels will be one hot encoded.
539
   cover = Cover.create_from_bed('cat_coverage',
540
                                 bedfiles=score_file,
541
                                 roi=roi,
542
                                 binsize=200,
543
                                 stepsize=200,
544
                                 resolution=None,
545
                                 collapser='max',
546
                                 mode='name_category')
547
548
   cover.shape
549
   # (100, 1, 1, 2)
550
551
   # category names
552
   cover.conditions
553
   # ['state1', 'state2']
554
555
   cover[4]
556
   # array([[[[0., 1.]]]])
557
558
   # bedgraph-format mode
559
   bedgraph_file = resource_filename('janggu',
560
                                     'resources/sample.bedgraph')
561
562
   cover = Cover.create_from_bed('bedgraph_coverage',
563
                                 bedfiles=bedgraph_file,
564
                                 roi=roi,
565
                                 binsize=200,
566
                                 stepsize=200,
567
                                 resolution=None,
568
                                 collapser='max',
569
                                 mode='bedgraph')
570
571
   cover.shape
572
   # (100, 1, 1, 1)
573
574
   cover[4]
575
   # array([[[[0.5]]]])
576
577
Dataset wrappers
578
^^^^^^^^^^^^^^^^^
579
580
In addition to the core datset :code:`Bioseq` and :code:`Cover`, Janggu offers convenience wrappers
581
to transform them in various ways.
582
For instance, :code:`ReduceDim` can be used to convert a 4D coverage dataset into 2D table like object.
583
That is it may be used to transform the dimensions
584
:code:`(region, region_length, strand, condition)` to :code:`(region, condition)` by
585
aggregating over the middle two dimensions.
586
587
.. code:: python
588
589
   from janggu.data import ReduceDim
590
591
   cover.shape
592
   # (100, 1, 1, 1)
593
594
   data = ReduceDim(cover, aggregator='sum')
595
596
   data.shape
597
   # (100, 1)
598
599
Other dataset wrappers can be used in order to perform data augmentation, including
600
:code:`RandomSignalScale` and :code:`RandomOrientation` which can be used
601
to randomly alter the signal intensity during model fitting and randomly flipping
602
the 5' to 3' orientations of the coverage signal.
603
604
For more specialized cases, these wrappers might also be a good starting point
605
to derive or adapt from.
606
607
Using the Genomic Datasets with keras or sklearn
608
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
609
610
The above mentioned datasets :code:`Bioseq` and :code:`Cover`
611
are directly compatible with keras and sklearn models. An illustration of a
612
simple convolutional neural network with keras is shown in
613
`keras cnn example`_.
614
Moreover, an example of a logistic regression model from sklearn used with Janggu
615
is shown in
616
`sklearn example`_.
617
618
619
Converting a Numpy array to :code:`Cover`
620
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
621
622
After having trained and performed predictions with a model, the data
623
is represented as numpy array. A convenient way to reassociate the
624
predictions with the genomic coordinates they correspond to is achieved
625
using :code:`create_from_array`.
626
627
.. code:: python
628
629
   import numpy as np
630
631
   # True labels may be obtained from a BED file
632
   cover = Cover.create_from_bigwig('cov',
633
                                     bigwigfiles=bw_file,
634
                                     roi=roi,
635
                                     binsize=200,
636
                                     resolution=50)
637
638
639
   # Let's pretend to have derived predictions from a NN
640
   # of the same shape
641
   predictions = np.random.randn(*cover.shape)*.1 + cover[:]
642
643
   # We can reassociate the predictions with the genomic coordinates
644
   # of a :code:`GenomicIndexer` (in this case, cover.gindexer).
645
   predictions = Cover.create_from_array('predictions',
646
                                         predictions, cover.gindexer)
647
648
Exporting and visualizing :code:`Cover`
649
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
650
651
After having converted the predictions or feature activities of a neural
652
network to a :code:`Cover` object, it is possible to export the results
653
as BIGWIG format for further investigation in a genome browser of your choice
654
655
.. code:: python
656
657
   # writes the predictions to a specified folder
658
   predictions.export_to_bigwig(output_dir = './')
659
660
661
which should result in a file 'predictions.Cond_0.bigwig'.
662
663
664
Furthermore, it is possible to visualize the tracks interactively
665
666
.. code:: python
667
668
   from janggu.data import LineTrack
669
   from janggu.data import plotGenomeTrack
670
671
   fig = plotGenomeTrack([LineTrack(cover), LineTrack(predictions)], 'chr1', 16000, 18000).figsave('coverage.png')
672
673
674
.. image:: coverage.png
675
   :width: 70%
676
   :alt: Coverage tracks
677
   :align: center
678
679
680
Part II) Building a neural network with Janggu
681
-----------------------------------------------
682
683
While the Genomic Dataset may be used directly with keras,
684
this part of the tutorial discusses the :class:`Janggu` wrapper class
685
for a keras model.
686
It offers the following features:
687
688
1. Building models using automatic input and output layer shape inference
689
2. Built-in logging functionality
690
3. Automatic evaluation through the attachment of Scorer callbacks
691
692
A list of examples can be found in the :ref:`Table <notebook_tutorials>` at the beginning.
693
694
.. sidebar:: Datasets are named
695
696
   Dataset names must match with the Input and Output layers of the neural
697
   network.
698
699
A neural network can be created by
700
instantiating a :class:`Janggu` object.
701
There are two ways of achieving this:
702
703
1. Similar as with `keras.models.Model`, a :class:`Janggu` object can be created from a set of native keras Input and Output layers, respectively.
704
2. Janggu offers a `Janggu.create` constructor method which helps to reduce redundant code when defining many rather similar models.
705
706
707
Example 1: Instantiate Janggu similar to keras.models.Model
708
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
709
710
.. sidebar:: **Model name**
711
712
   Model results,
713
   e.g. trained parameters, are automatically stored with the associated model name. To simplify the determination of a unique name for the model, Janggu automatically derives the model name based on a md5-hash of the network configuration. However, you can also specify a name yourself.
714
715
716
.. code-block:: python
717
718
  from keras.layers import Input
719
  from keras.layers import Dense
720
721
  from janggu import Janggu
722
723
  # Define neural network layers using keras
724
  in_ = Input(shape=(10,), name='ip')
725
  layer = Dense(3)(in_)
726
  output = Dense(1, activation='sigmoid',
727
                 name='out')(layer)
728
729
  # Instantiate model name.
730
  model = Janggu(inputs=in_, outputs=output)
731
  model.summary()
732
733
734
735
Example 2: Specify a model using a model template function
736
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
737
As an alternative to the above stated variant, it is also possible to specify
738
a network via a python function as in the following example
739
740
.. code-block:: python
741
742
   def model_template(inputs, inp, oup, params):
743
       inputs = Input(shape=(10,), name='ip')
744
       layer = Dense(params)(inputs)
745
       output = Dense(1, activation='sigmoid',
746
                      name='out')(layer)
747
       return inputs, output
748
749
   # Defines the same model by invoking the definition function
750
   # and the create constructor.
751
   model = Janggu.create(template=model_template,
752
                         modelparams=3)
753
754
The model template function must adhere to the
755
signature :code:`template(inputs, inp, oup, params)`.
756
Notice, that :code:`modelparams=3` gets passed on to :code:`params`
757
upon model creation. This allows to parametrize the network
758
and reduces code redundancy.
759
760
From the model template it is also possible to obtain
761
a keras model directly, rather than the Janggu model wrapper if this is preferred
762
763
.. code-block:: python
764
765
   from janggu import create_model
766
767
   # This will construct a keras model directly
768
   model = create_model(template=model_template,
769
                        modelparams=3)
770
771
772
Example 3: Automatic Input and Output layer extension
773
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
774
A second benefit to invoke :code:`Janggu.create` is that it can automatically
775
determine and append appropriate Input and Output layers to the network.
776
This means, only the network body remains to be defined.
777
778
.. code-block:: python
779
780
    import numpy as np
781
    from janggu import inputlayer, outputdense
782
    from janggu.data import Array
783
784
    # Some random data
785
    DATA = Array('ip', np.random.random((1000, 10)))
786
    LABELS = Array('out', np.random.randint(2, size=(1000, 1)))
787
788
    # inputlayer and outputdense automatically
789
    # extract dataset shapes and extend the
790
    # Input and Output layers appropriately.
791
    # That is, only the model body needs to be specified.
792
    @inputlayer
793
    @outputdense('sigmoid')
794
    def model_body_template(inputs, inp, oup, params):
795
        with inputs.use('ip') as layer:
796
            # the with block allows
797
            # for easy access of a specific named input.
798
            output = Dense(params)(layer)
799
        return inputs, output
800
801
    # create the model.
802
    model = Janggu.create(template=model_body_template,
803
                          modelparams=3,
804
                          inputs=DATA, outputs=LABELS)
805
    model.summary()
806
807
As is illustrated by the example, automatic Input and Output layer determination
808
can be achieved by using the decorators :code:`inputlayer` and/or
809
:code:`outputdense` which extract the layer dimensions from the
810
provided input and output Datasets in the create constructor.
811
812
813
Fit a neural network on DNA sequences
814
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
815
In the previous sections, we learned how to acquire data and
816
how to instantiate neural networks. Now let's
817
create and fit a simple convolutional neural network that learns
818
to discriminate between two classes of sequences. In the following example
819
the sample sequences are of length 200 bp each. `sample.fa` contains Oct4 CHip-seq
820
peaks and sample2.fa contains Mafk CHip-seq peaks. We shall use a simple
821
convolutional neural network with 30 filters of length 21 bp to learn
822
the sequence features that discriminate the two sets of sequences.
823
824
The example makes use of two more janggu utilities: First,
825
:code:`DnaConv2D` constitutes a keras layer wrapper that facilitates scanning
826
of both DNA strands with the same kernels. That is it simulataneously applies
827
a convolution and a cross-correlation and aggregates the resulting activities.
828
Second, the example illustrates the dataset wrapper :code:`ReduceDim` which
829
allows to collapse 4D the signal contained in the Cover object
830
across the sequence length and strand dimension. The result is yields a 2D
831
table-like dataset which is used in the subsequent model fitting example.
832
833
.. code:: python
834
835
   from keras.layers import Conv2D
836
   from keras.layers import GlobalAveragePooling2D
837
   from janggu import inputlayer
838
   from janggu import outputconv
839
   from janggu import DnaConv2D
840
   from janggu.data import ReduceDim
841
842
843
   # load the dataset which consists of
844
   # 1) a reference genome
845
   REFGENOME = resource_filename('janggu', 'resources/pseudo_genome.fa')
846
   # 2) ROI contains regions spanning positive and negative examples
847
   ROI_FILE = resource_filename('janggu', 'resources/roi_train.bed')
848
   # 3) PEAK_FILE only contains positive examples
849
   PEAK_FILE = resource_filename('janggu', 'resources/scores.bed')
850
851
   # DNA sequences are loaded directly from the reference genome
852
   DNA = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,
853
                                      roi=ROI_FILE,
854
                                      binsize=200)
855
856
   # Classification labels over the same regions are loaded
857
   # into the Coverage dataset.
858
   # It is important that both DNA and LABELS load with the same
859
   # binsize, stepsize to ensure
860
   # the correct correspondence between both datasets.
861
   # Finally, the ReduceDim dataset wrapper transforms the 4D Coverage
862
   # object into a 2D table like object (regions by conditions)
863
   LABELS = ReduceDim(Cover.create_from_bed('peaks', roi=ROI_FILE,
864
                                  bedfiles=PEAK_FILE,
865
                                  binsize=200,
866
                                  resolution=None), aggregator='mean')
867
868
869
   # 2. define a simple conv net with 30 filters of length 15 bp
870
   # and relu activation.
871
   # outputconv as opposed to outputdense will put a conv layer as output
872
   @inputlayer
873
   @outputdense('sigmoid')
874
   def double_stranded_model(inputs, inp, oup, params):
875
       with inputs.use('dna') as layer:
876
           # The DnaConv2D wrapper can be used with Conv2D
877
           # to scan both DNA strands with the weight matrices.
878
           layer = DnaConv2D(Conv2D(params[0], (params[1], 1),
879
                                    activation=params[2]))(layer)
880
881
       output = GlobalAveragePooling2D(name='motif')(layer)
882
       return inputs, output
883
884
885
   # 3. instantiate and compile the model
886
   model = Janggu.create(template=double_stranded_model,
887
                         modelparams=(30, 15, 'relu'),
888
                         inputs=DNA, outputs=LABELS)
889
   model.compile(optimizer='adadelta', loss='binary_crossentropy',
890
                 metrics=['acc'])
891
892
   # 4. fit the model
893
   model.fit(DNA,LABELS,epochs=100)
894
895
896
An illustration of the network architecture is depicted below.
897
Upon creation of the model a network depiction is
898
automatically produced in :code:`<results_root>/models` which is illustrated
899
below
900
901
.. image:: dna_peak.png
902
   :width: 70%
903
   :alt: Prediction from DNA to peaks
904
   :align: center
905
906
After the model has been trained, the model parameters and the
907
illustration of the architecture are stored in :code:`<results_root>/models`.
908
Furthermore, information about the model fitting, model and dataset dimensions
909
are written to :code:`<results_root>/logs`.
910
911
Note that in the example above the output dimensionality of the network is 4D.
912
However, it might be more convenient at times to remove the single dimensional
913
elements of the array.
914
This can be achieved by wrapping the LABELS dataset using :code:`ReduceDim`.
915
In this case the example becomes
916
917
.. code:: python
918
919
   @inputlayer
920
   @outputdense('sigmoid')
921
   def double_stranded_model(inputs, inp, oup, params):
922
       with inputs.use('dna') as layer:
923
           # The DnaConv2D wrapper can be used with Conv2D
924
           # to scan both DNA strands with the weight matrices.
925
           layer = DnaConv2D(Conv2D(params[0], (params[1], 1),
926
                                    activation=params[2]))(layer)
927
928
       output = GlobalAveragePooling2D(name='motif')(layer)
929
       return inputs, output
930
931
932
   # 3. instantiate and compile the model
933
   model = Janggu.create(template=double_stranded_model,
934
                         modelparams=(30, 15, 'relu'),
935
                         inputs=DNA, outputs=LABELS)
936
   model.compile(optimizer='adadelta', loss='binary_crossentropy',
937
                 metrics=['acc'])
938
939
   # 4. fit the model
940
   model.fit(DNA, LABELS, epochs=100)
941
942
943
Part III) Evaluation and interpretation of the model
944
-----------------------------------------------------
945
946
Janggu supports various methods to evaluate and interprete a trained model,
947
including evaluating summary scores, inspecting the results in
948
the built-in genome browser (see Part I), evaluating the integrated gradients
949
which allows to visualized input feature importance and by
950
offering support for variant effect predictions.
951
In this last part we will illustrate these aspects.
952
953
Evaluation of summary scores
954
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
955
956
After the model has been trained, the quality of the predictions
957
is usually summarized by its agreement with the ground truth, e.g. by
958
evaluating the area under the ROC curve in a binary classification application
959
or by computing the correlation between predictions and targets in a regression
960
setting.
961
962
For some commonly used evaluation criteria, the evaluate method directly allows
963
to determine and export the given metric results.
964
For example, for a classification task the following line
965
evaluates the ROC and PRC and exports a figure and a tsv file, respectively,
966
for each measure.
967
968
.. code-block:: python
969
970
   model.evaluate(DNA_TEST, LABELS_TEST, callbacks=['roc', 'prc', 'auprc', 'auroc'])
971
972
The results are stored in :code:`<results_root>/evaluation/{roc,prc}.png`
973
as well as :code:`<results_root>/evaluation/{auroc,auprc}.tsv`.
974
975
Furthermore, for a regression setting it is possible to invoke
976
977
.. code-block:: python
978
979
   model.evaluate(DNA_TEST, LABELS_TEST, callbacks=['cor', 'mae', 'mse', 'var_explained'])
980
981
which evaluates the Pearson's correlation, the mean absolute error, the mean squared error
982
and the explained variance into tsv files.
983
984
985
It is also possible to customize the scoring callbacks by instantiating a
986
:code:`Scorer` objects which can be passed to
987
:code:`model.evaluate` and :code:`model.predict`. Further details about
988
customizing the scoring callbacks are given in :doc:`custom_scorer`.
989
990
Input feature importance
991
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
992
993
In order to inspect what the model has learned,
994
it is possible to identify the most important features in the input space
995
using the integrated gradients method.
996
997
This is illustrated on a toy example for discriminating Oct4 and Mafk binding sites (see
998
`variant effect prediction`_).
999
1000
Variant effect prediction
1001
^^^^^^^^^^^^^^^^^^^^^^^^^^^
1002
1003
In order to measure the effect of single nucleotide variant on the predict
1004
network output can be tested via the :code:`Janggu.predict_variant_effect`
1005
based on a Bioseq object and single nucleotide variants in VCF format.
1006
This method evaluates the network for each variant (using its sequence context)
1007
as well as its respective reference sequence.
1008
As a result, an hdf5 file and a bed file will be produced which
1009
contain the network predictions for each variant and the associated genomic
1010
loci.
1011
An illustration of the variant effect prediction in the notebook (see
1012
`variant effect prediction`_).
1013
1014
Browse through the results
1015
^^^^^^^^^^^^^^^^^^^^^^^^^^
1016
Finally, after you have fitted and evaluated your results
1017
you can browse through the results in an web browser of your choice.
1018
1019
To this end, first start the web application server
1020
1021
::
1022
1023
   janggu -path <results-root>
1024
1025
Then you can inspect the outputs in a browser of your choice
1026
(default: localhost:8050)
1027
1028
.. image:: janggu_example.png
1029
   :width: 70%
1030
   :alt: Prediction from DNA to peaks
1031
   :align: center