--- a +++ b/docs/tutorial.rst @@ -0,0 +1,1031 @@ +========= +Tutorial +========= + +This tutorial is split in three parts. + +Part I treats the Genomic Dataset +that are available through Janggu which can be directly consumed +by your keras model. +The tutorial illustates how to access genomics +data from various widely used file formats, including FASTA, BAM, BIGWIG, BED +and GFF for the purpose of using them as input to a deep learning application. +It illustrates a range of parameters to adapt the read out of genomics data +and it shows how the predictions or feature activities of a neural network +in numpy format can be converted to a genomic coverage representation +that can in turn be exported to BIGWIG file format +or visualized directly via a genome browser-like plot. + +Part II treats utilities for defining a neural networks based on keras. + +Part III illustrates Janggu's evaluation utilities. + +Complementary to this tutorial, the janggu repository contains +a number of jupyter notebooks that illustrate for example with keras or +sklearn: + +.. _notebook_tutorials: + ++----------------------------------------------------+ +| Example notebooks | ++====================================================+ +| `keras cnn example`_ | ++----------------------------------------------------+ +| `keras cnn example with JangguSequence`_ | ++----------------------------------------------------+ +| `sklearn example`_ | ++----------------------------------------------------+ +| `pytorch example`_ | ++----------------------------------------------------+ +| `janggu example I`_ | ++----------------------------------------------------+ +| `janggu example II`_ | ++----------------------------------------------------+ +| `reusing datasets with view`_ | ++----------------------------------------------------+ +| `hyperparameter optimization`_ | ++----------------------------------------------------+ +| `randomizing HDF5 data`_ | ++----------------------------------------------------+ +| `variant effect prediction`_ | ++----------------------------------------------------+ +| `variant effect prediction - part II`_ | ++----------------------------------------------------+ +| `plotting genome coverage`_ | ++----------------------------------------------------+ + +.. _`keras cnn example`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/keras_convnet_example.ipynb +.. _`keras cnn example with JangguSequence`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/keras_convnet_example_w_sequence.ipynb +.. _`sklearn example`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/sklearn_example_with_kmers.ipynb +.. _`pytorch example`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/pytorch_convnet_example.ipynb +.. _`janggu example I`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/classify_nucleotide_sequences.ipynb +.. _`janggu example II`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/janggu_convnet_examples.ipynb +.. _`reusing datasets with view`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/janggu_convnet_examples_with_hdf5.ipynb +.. _`randomizing HDF5 data`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/janggu_convnet_examples_with_hdf5.ipynb +.. _`variant effect prediction`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/variant_effect_prediction.ipynb +.. _`variant effect prediction - part II`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/variant_effect_prediction-part2.ipynb +.. _`plotting genome coverage`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/plot_coverage.ipynb +.. _`hyperparameter optimization`: https://nbviewer.jupyter.org/github/BIMSBbioinfo/janggu/blob/master/src/examples/hyperparameter_optimization_w_janggu.ipynb + + +Furthermore, use cases +for predicting JunD binding, training and adapting published genomics models +as well as a regression model example are demonstrated in +the supplementary repository: `Janggu use cases`_ + + +.. _`Janggu use cases`: https://github.com/wkopp/janggu_usecases + + + +Part I) Introduction to Genomic Datasets +----------------------------------------- + +.. sidebar:: Genomic Datasets + + Most of the parameters are consistent across + :class:`Bioseq` and :class:`Cover`. + +:mod:`janggu.data` provides Dataset classes +that can be used for +training and evaluating neural networks. +Of particular importance are the Genomics-specific dataset, +:class:`Bioseq` and :class:`Cover` which +allow easy access to genomics data, +including DNA sequences or coverage information. +Apart from accessing raw genomics data, :code:`Janggu` +also facilitates a method for converting an ordinary +numpy array (e.g. predictions obtained from a neural net) +to a :code:`Cover` object. This enables the user to export the predictions +as BIGWIG format or interactively plot genome browser tracks. +In this tutorial, we demonstrate some of the key functionality of +Janggu. Further details are available in :ref:`storage` +and :ref:`reference-label`. + +Bioseq +^^^^^^^^^^ +The :class:`Bioseq` can be used to load nucleotide +or protein sequence data from +fasta files or from a reference genome +along with a set of genomic coordinates defining the region of interest (ROI). +The class facilitates access the +*one-hot encoding* representation of the sequences. +Specifically, +the *one-hot encoding* is represented as a +4D array with dimensions corresponding +to :code:`(region, region_length, 1, alphabet_size)`. +The Bioseq offers a number of features: + +1. Strand-specific sequence extraction (if DNA sequences are extracted from the reference genome) +2. Higher-order one-hot encoding, e.g. di-nucleotide based + +Sequences can be loaded in two ways: using +:code:`Bioseq.create_from_seq` or +:code:`Bioseq.create_from_refgenome`. +The former constructor method can be used to load +DNA or protein sequences from fasta files directly +or from as list of :code:`Bio.SeqRecord.SeqRecord` entries. +An example is shown below: + +.. code-block:: python + + from pkg_resources import resource_filename + from janggu.data import Bioseq + + fasta_file = resource_filename('janggu', + 'resources/sample.fa') + + dna = Bioseq.create_from_seq(name='dna', + fastafile=fasta_file) + + # there are 3897 sequences in the in sample.fa + len(dna) + + # Each sequence is 200 bp of length + dna.shape # is (3897, 200, 1, 4) + + # One-hot encoding for the first 10 bases of the first region + dna[0][0, :10, 0, :] + + #array([[0, 1, 0, 0], + # [1, 0, 0, 0], + # [0, 1, 0, 0], + # [1, 0, 0, 0], + # [0, 0, 1, 0], + # [0, 1, 0, 0], + # [1, 0, 0, 0], + # [0, 0, 1, 0], + # [1, 0, 0, 0], + # [0, 0, 1, 0]], dtype=int8) + +Furthermore, it is possible to trim variable sequence length using +the :code:`fixedlen` option. If specfied, all sequences will be truncated +or zero-padded to length `fixedlen`. For example, + +.. code-block:: python + + dna = Bioseq.create_from_seq(name='dna', + fastafile=fasta_file, + fixedlen=205) + + # Each sequence is 205 bp of length + dna.shape # is (3897, 205, 1, 4) + + # the last 5 position were zero padded + dna[0][0, -6:, 0, :] + + #array([[1, 0, 0, 0], + # [0, 0, 0, 0], + # [0, 0, 0, 0], + # [0, 0, 0, 0], + # [0, 0, 0, 0], + # [0, 0, 0, 0]], dtype=int8) + + +Alternatively, nucleotide sequences can be +obtained from a reference genome directly along with +a BED or GFF file that indicates the region of interest (ROI). + +If each interval in the BED-file already corresponds +to a 'datapoint' that shall be consumed during training, like it +is the case for 'sample_equalsize.bed', the associated DNA sequences +can be loaded according to + +.. code-block:: python + + roi = resource_filename('janggu', + 'resources/sample_equalsize.bed') + refgenome = resource_filename('janggu', + 'resources/sample_genome.fa') + + dna = Bioseq.create_from_refgenome(name='dna', + refgenome=refgenome, + roi=roi) + + dna.shape # is (4, 200, 1, 4) + # One-hot encoding of the first 10 nucleotides in region 0 + dna[0][0, :10, 0, :] + + #array([[0, 1, 0, 0], + # [1, 0, 0, 0], + # [0, 1, 0, 0], + # [1, 0, 0, 0], + # [0, 0, 1, 0], + # [0, 1, 0, 0], + # [1, 0, 0, 0], + # [0, 0, 1, 0], + # [1, 0, 0, 0], + # [0, 0, 1, 0]], dtype=int8) + + + +Sometimes it is more convenient to provide the ROI +as a set of variable-sized broad intervals +(e.g. chr1:10000-50000 and chr3:4000-8000) +which should be divided into sub-intervals +of equal length (e.g. of length 200 bp). +This can be achieved +by explicitly specifying a desired :code:`binsize` +and :code:`stepsize` as shown below: + +.. code-block:: python + + roi = resource_filename('janggu', + 'resources/sample.bed') + + # loading non-overlapping intervals + dna = Bioseq.create_from_refgenome(name='dna', + refgenome=refgenome, + roi=roi, + binsize=200, + stepsize=200) + + dna.shape # is (100, 200, 1, 4) + + # loading mutually overlapping intervals + dna = Bioseq.create_from_refgenome(name='dna', + refgenome=refgenome, + roi=roi, + binsize=200, + stepsize=50) + + dna.shape # is (394, 200, 1, 4) + + +The argument :code:`flank` can be used to extend +the intervals up and downstream by a given length + +.. code-block:: python + + dna = Bioseq.create_from_refgenome(name='dna', + refgenome=refgenome, + roi=roi, + binsize=200, + stepsize=200, + flank=100) + + dna.shape # is (100, 400, 1, 4) + + +Finally, sequences can be represented using **higher-order** +one-hot representation using the :code:`order` argument. An example +of a di-nucleotide-based one-hot representation is shown below + + +.. code-block:: python + + dna = Bioseq.create_from_refgenome(name='dna', + refgenome=refgenome, + roi=roi, + binsize=200, + stepsize=200, + order=2) + + # is (100, 199, 1, 16) + # that is the last dimension represents di-nucleotides + dna.shape + + dna.conditions + # ['AA', 'AC', 'AG', 'AT', 'CA', 'CC', 'CG', 'CT', 'GA', 'GC', 'GG', 'GT', 'TA', 'TC', 'TG', 'TT'] + + dna[0][0, :5, 0, :] + + #array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], + # [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], + # [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0], + # [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0], + # [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]], dtype=int8) + + + +Cover +^^^^^ +:class:`Cover` can be utilized to fetch different kinds of +coverage data from commonly used data formats, +including BAM, BIGWIG, BED and GFF. +Coverage data is stored as a 4D array with dimensions corresponding +to :code:`(region, region_length, strand, condition)`. + +The following examples illustrate some use cases for :class:`Cover`, +including loading, normalizing coverage data. +Additional features are described in the :ref:`reference-label`. + +**Loading read count coverage from BAM files** is supported for +single-end and paired-end alignments. For the single-end case +reads are counted on the 5'-end and and for paired-end alignments, +reads are optionally counted at the mid-points or 5' ends of the first mate. +The following example illustrate how to extract base-pair resolution coverage +with and without strandedness. + +.. code:: python + + from janggu.data import Cover + + bam_file = resource_filename('janggu', + 'resources/sample.bam') + + roi = resource_filename('janggu', + 'resources/sample.bed') + + cover = Cover.create_from_bam('read_count_coverage', + bamfiles=bam_file, + binsize=200, + stepsize=200, + roi=roi) + + cover.shape # is (100, 200, 2, 1) + cover[0] # coverage of the first region + + # Coverage regardless of read strandedness + # sums reads from both strand. + cover = Cover.create_from_bam('read_coverage', + bamfiles=bam_file, + binsize=200, + stepsize=200, + stranded=False, + roi=roi) + + cover.shape # is (100, 200, 1, 1) + + +Sometimes it is desirable to determine the read +count coverage in say 50 bp bins which can be +controlled by the :code:`resolution` argument. +Consequently, note that the second dimension amounts +to length 4 using `binsize=200` and `resolution=50` in the following example + +.. code:: python + + # example with resolution=200 bp + cover = Cover.create_from_bam('read_coverage', + bamfiles=bam_file, + binsize=200, + resolution=50, + roi=roi) + + cover.shape # is (100, 4, 2, 1) + + +It might be desired to aggregate reads across entire interval +rather than binning the genome to equally sized bins of +length :code:`resolution`. An example application for this would +be to count reads per possibly variable-size regions (e.g. genes). +This can be achived by :code:`resolution=None` which results +in the second dimension being collapsed to a length of one. + +.. code:: python + + # example with resolution=None + cover = Cover.create_from_bam('read_coverage', + bamfiles=bam_file, + binsize=200, + resolution=None, + roi=roi) + + cover.shape # is (100, 1, 2, 1) + +Similarly, if strandedness is not relevant we may use + +.. code:: python + + # example with resolution=None without strandedness + cover = Cover.create_from_bam('read_coverage', + bamfiles=bam_file, + binsize=200, + resolution=None, + stranded=False, + roi=roi) + + cover.shape # is (100, 1, 1, 1) + +Finally, it is possible to normalize the coverage profile, e.g. +to account for differences in sequencing depth across experiments +using the :code:`normalizer` argument + +.. code:: python + + # example with resolution=None without strandedness + cover = Cover.create_from_bam('read_coverage', + bamfiles=bam_file, + binsize=200, + resolution=None, + stranded=False, + normalizer='tpm', + roi=roi) + + cover.shape # is (100, 1, 1, 1) + +More details on alternative normalization +options are discussed in :ref:`storage`. + +**Loading signal coverage from BIGWIG files** +can be achieved analogously: + +.. code-block:: python + + roi = resource_filename('janggu', + 'resources/sample.bed') + bw_file = resource_filename('janggu', + 'resources/sample.bw') + + cover = Cover.create_from_bigwig('bigwig_coverage', + bigwigfiles=bw_file, + roi=roi, + binsize=200, + stepsize=200) + + cover.shape # is (100, 200, 1, 1) + + +When applying signal aggregation using e.g :code:`resolution=50` or :code:`resolution=None`, +additionally, the aggregation method can be specified using +the :code:`collapser` argument. +For example, in order to represent the resolution sized +bin by its mean signal the following snippet may be used: + +.. code-block:: python + + cover = Cover.create_from_bigwig('bigwig_coverage', + bigwigfiles=bw_file, + roi=roi, + binsize=200, + resolution=None, + collapser='mean') + + cover.shape # is (100, 1, 1, 1) + + +More details on alternative collapse +options are discussed in :ref:`storage`. + + +**Coverage from a BED files** is largely analogous to extracting coverage +information from BAM or BIGWIG files, but in addition the :code:`mode` option +enables to interpret BED-like files in various ways: + +1. :code:`mode='binary'` Presence/Absence mode interprets the ROI as the union of positive and negative cases in a binary classification setting and regions contained in :code:`bedfiles` as positive examples. +2. :code:`mode='score'` reads out the real-valued score field value from the associated regions. +3. :code:`mode='score_category'` transforms integer-valued scores into a categorical one-hot representation. +4. :code:`mode='name_category'` transforms the name field into a categorical one-hot representation. +5. :code:`mode='bedgraph'` reads in the score from a file in bedgraph format. + +Examples of loading data from a BED file are shown below + +.. code-block:: python + + roi = resource_filename('janggu', + 'resources/sample.bed') + score_file = resource_filename('janggu', + 'resources/scored_sample.bed') + + # binary mode (default) + cover = Cover.create_from_bed('binary_coverage', + bedfiles=score_file, + roi=roi, + binsize=200, + stepsize=200, + collapser='max', + resolution=None) + + cover.shape + # (100, 1, 1, 1) + + cover[4] + # array([[[[1.]]]]) + + # score mode + cover = Cover.create_from_bed('score_coverage', + bedfiles=score_file, + roi=roi, + binsize=200, + stepsize=200, + resolution=None, + collapser='max', + mode='score') + + cover.shape + # (100, 1, 1, 1) + + cover[4] + # array([[[[5.]]]]) + + + # scoreclass (or categorical) mode + # Interprets the integer-valued score as class-label, + # which will then be one-hot encoded. + cover = Cover.create_from_bed('cat_coverage', + bedfiles=score_file, + roi=roi, + binsize=200, + stepsize=200, + resolution=None, + collapser='max', + mode='score_category') + + # there are 4 categories + cover.shape + # (100, 1, 1, 4) + + # category names + cover.conditions + # ['1', '2', '4', '5'] + + cover[4] + # array([[[[0., 0., 0., 1.]]]]) + + # nameclass mode + # Interprets the name field as class-label. + # The class labels will be one hot encoded. + cover = Cover.create_from_bed('cat_coverage', + bedfiles=score_file, + roi=roi, + binsize=200, + stepsize=200, + resolution=None, + collapser='max', + mode='name_category') + + cover.shape + # (100, 1, 1, 2) + + # category names + cover.conditions + # ['state1', 'state2'] + + cover[4] + # array([[[[0., 1.]]]]) + + # bedgraph-format mode + bedgraph_file = resource_filename('janggu', + 'resources/sample.bedgraph') + + cover = Cover.create_from_bed('bedgraph_coverage', + bedfiles=bedgraph_file, + roi=roi, + binsize=200, + stepsize=200, + resolution=None, + collapser='max', + mode='bedgraph') + + cover.shape + # (100, 1, 1, 1) + + cover[4] + # array([[[[0.5]]]]) + +Dataset wrappers +^^^^^^^^^^^^^^^^^ + +In addition to the core datset :code:`Bioseq` and :code:`Cover`, Janggu offers convenience wrappers +to transform them in various ways. +For instance, :code:`ReduceDim` can be used to convert a 4D coverage dataset into 2D table like object. +That is it may be used to transform the dimensions +:code:`(region, region_length, strand, condition)` to :code:`(region, condition)` by +aggregating over the middle two dimensions. + +.. code:: python + + from janggu.data import ReduceDim + + cover.shape + # (100, 1, 1, 1) + + data = ReduceDim(cover, aggregator='sum') + + data.shape + # (100, 1) + +Other dataset wrappers can be used in order to perform data augmentation, including +:code:`RandomSignalScale` and :code:`RandomOrientation` which can be used +to randomly alter the signal intensity during model fitting and randomly flipping +the 5' to 3' orientations of the coverage signal. + +For more specialized cases, these wrappers might also be a good starting point +to derive or adapt from. + +Using the Genomic Datasets with keras or sklearn +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The above mentioned datasets :code:`Bioseq` and :code:`Cover` +are directly compatible with keras and sklearn models. An illustration of a +simple convolutional neural network with keras is shown in +`keras cnn example`_. +Moreover, an example of a logistic regression model from sklearn used with Janggu +is shown in +`sklearn example`_. + + +Converting a Numpy array to :code:`Cover` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +After having trained and performed predictions with a model, the data +is represented as numpy array. A convenient way to reassociate the +predictions with the genomic coordinates they correspond to is achieved +using :code:`create_from_array`. + +.. code:: python + + import numpy as np + + # True labels may be obtained from a BED file + cover = Cover.create_from_bigwig('cov', + bigwigfiles=bw_file, + roi=roi, + binsize=200, + resolution=50) + + + # Let's pretend to have derived predictions from a NN + # of the same shape + predictions = np.random.randn(*cover.shape)*.1 + cover[:] + + # We can reassociate the predictions with the genomic coordinates + # of a :code:`GenomicIndexer` (in this case, cover.gindexer). + predictions = Cover.create_from_array('predictions', + predictions, cover.gindexer) + +Exporting and visualizing :code:`Cover` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +After having converted the predictions or feature activities of a neural +network to a :code:`Cover` object, it is possible to export the results +as BIGWIG format for further investigation in a genome browser of your choice + +.. code:: python + + # writes the predictions to a specified folder + predictions.export_to_bigwig(output_dir = './') + + +which should result in a file 'predictions.Cond_0.bigwig'. + + +Furthermore, it is possible to visualize the tracks interactively + +.. code:: python + + from janggu.data import LineTrack + from janggu.data import plotGenomeTrack + + fig = plotGenomeTrack([LineTrack(cover), LineTrack(predictions)], 'chr1', 16000, 18000).figsave('coverage.png') + + +.. image:: coverage.png + :width: 70% + :alt: Coverage tracks + :align: center + + +Part II) Building a neural network with Janggu +----------------------------------------------- + +While the Genomic Dataset may be used directly with keras, +this part of the tutorial discusses the :class:`Janggu` wrapper class +for a keras model. +It offers the following features: + +1. Building models using automatic input and output layer shape inference +2. Built-in logging functionality +3. Automatic evaluation through the attachment of Scorer callbacks + +A list of examples can be found in the :ref:`Table <notebook_tutorials>` at the beginning. + +.. sidebar:: Datasets are named + + Dataset names must match with the Input and Output layers of the neural + network. + +A neural network can be created by +instantiating a :class:`Janggu` object. +There are two ways of achieving this: + +1. Similar as with `keras.models.Model`, a :class:`Janggu` object can be created from a set of native keras Input and Output layers, respectively. +2. Janggu offers a `Janggu.create` constructor method which helps to reduce redundant code when defining many rather similar models. + + +Example 1: Instantiate Janggu similar to keras.models.Model +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. sidebar:: **Model name** + + Model results, + e.g. trained parameters, are automatically stored with the associated model name. To simplify the determination of a unique name for the model, Janggu automatically derives the model name based on a md5-hash of the network configuration. However, you can also specify a name yourself. + + +.. code-block:: python + + from keras.layers import Input + from keras.layers import Dense + + from janggu import Janggu + + # Define neural network layers using keras + in_ = Input(shape=(10,), name='ip') + layer = Dense(3)(in_) + output = Dense(1, activation='sigmoid', + name='out')(layer) + + # Instantiate model name. + model = Janggu(inputs=in_, outputs=output) + model.summary() + + + +Example 2: Specify a model using a model template function +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +As an alternative to the above stated variant, it is also possible to specify +a network via a python function as in the following example + +.. code-block:: python + + def model_template(inputs, inp, oup, params): + inputs = Input(shape=(10,), name='ip') + layer = Dense(params)(inputs) + output = Dense(1, activation='sigmoid', + name='out')(layer) + return inputs, output + + # Defines the same model by invoking the definition function + # and the create constructor. + model = Janggu.create(template=model_template, + modelparams=3) + +The model template function must adhere to the +signature :code:`template(inputs, inp, oup, params)`. +Notice, that :code:`modelparams=3` gets passed on to :code:`params` +upon model creation. This allows to parametrize the network +and reduces code redundancy. + +From the model template it is also possible to obtain +a keras model directly, rather than the Janggu model wrapper if this is preferred + +.. code-block:: python + + from janggu import create_model + + # This will construct a keras model directly + model = create_model(template=model_template, + modelparams=3) + + +Example 3: Automatic Input and Output layer extension +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +A second benefit to invoke :code:`Janggu.create` is that it can automatically +determine and append appropriate Input and Output layers to the network. +This means, only the network body remains to be defined. + +.. code-block:: python + + import numpy as np + from janggu import inputlayer, outputdense + from janggu.data import Array + + # Some random data + DATA = Array('ip', np.random.random((1000, 10))) + LABELS = Array('out', np.random.randint(2, size=(1000, 1))) + + # inputlayer and outputdense automatically + # extract dataset shapes and extend the + # Input and Output layers appropriately. + # That is, only the model body needs to be specified. + @inputlayer + @outputdense('sigmoid') + def model_body_template(inputs, inp, oup, params): + with inputs.use('ip') as layer: + # the with block allows + # for easy access of a specific named input. + output = Dense(params)(layer) + return inputs, output + + # create the model. + model = Janggu.create(template=model_body_template, + modelparams=3, + inputs=DATA, outputs=LABELS) + model.summary() + +As is illustrated by the example, automatic Input and Output layer determination +can be achieved by using the decorators :code:`inputlayer` and/or +:code:`outputdense` which extract the layer dimensions from the +provided input and output Datasets in the create constructor. + + +Fit a neural network on DNA sequences +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +In the previous sections, we learned how to acquire data and +how to instantiate neural networks. Now let's +create and fit a simple convolutional neural network that learns +to discriminate between two classes of sequences. In the following example +the sample sequences are of length 200 bp each. `sample.fa` contains Oct4 CHip-seq +peaks and sample2.fa contains Mafk CHip-seq peaks. We shall use a simple +convolutional neural network with 30 filters of length 21 bp to learn +the sequence features that discriminate the two sets of sequences. + +The example makes use of two more janggu utilities: First, +:code:`DnaConv2D` constitutes a keras layer wrapper that facilitates scanning +of both DNA strands with the same kernels. That is it simulataneously applies +a convolution and a cross-correlation and aggregates the resulting activities. +Second, the example illustrates the dataset wrapper :code:`ReduceDim` which +allows to collapse 4D the signal contained in the Cover object +across the sequence length and strand dimension. The result is yields a 2D +table-like dataset which is used in the subsequent model fitting example. + +.. code:: python + + from keras.layers import Conv2D + from keras.layers import GlobalAveragePooling2D + from janggu import inputlayer + from janggu import outputconv + from janggu import DnaConv2D + from janggu.data import ReduceDim + + + # load the dataset which consists of + # 1) a reference genome + REFGENOME = resource_filename('janggu', 'resources/pseudo_genome.fa') + # 2) ROI contains regions spanning positive and negative examples + ROI_FILE = resource_filename('janggu', 'resources/roi_train.bed') + # 3) PEAK_FILE only contains positive examples + PEAK_FILE = resource_filename('janggu', 'resources/scores.bed') + + # DNA sequences are loaded directly from the reference genome + DNA = Bioseq.create_from_refgenome('dna', refgenome=REFGENOME, + roi=ROI_FILE, + binsize=200) + + # Classification labels over the same regions are loaded + # into the Coverage dataset. + # It is important that both DNA and LABELS load with the same + # binsize, stepsize to ensure + # the correct correspondence between both datasets. + # Finally, the ReduceDim dataset wrapper transforms the 4D Coverage + # object into a 2D table like object (regions by conditions) + LABELS = ReduceDim(Cover.create_from_bed('peaks', roi=ROI_FILE, + bedfiles=PEAK_FILE, + binsize=200, + resolution=None), aggregator='mean') + + + # 2. define a simple conv net with 30 filters of length 15 bp + # and relu activation. + # outputconv as opposed to outputdense will put a conv layer as output + @inputlayer + @outputdense('sigmoid') + def double_stranded_model(inputs, inp, oup, params): + with inputs.use('dna') as layer: + # The DnaConv2D wrapper can be used with Conv2D + # to scan both DNA strands with the weight matrices. + layer = DnaConv2D(Conv2D(params[0], (params[1], 1), + activation=params[2]))(layer) + + output = GlobalAveragePooling2D(name='motif')(layer) + return inputs, output + + + # 3. instantiate and compile the model + model = Janggu.create(template=double_stranded_model, + modelparams=(30, 15, 'relu'), + inputs=DNA, outputs=LABELS) + model.compile(optimizer='adadelta', loss='binary_crossentropy', + metrics=['acc']) + + # 4. fit the model + model.fit(DNA,LABELS,epochs=100) + + +An illustration of the network architecture is depicted below. +Upon creation of the model a network depiction is +automatically produced in :code:`<results_root>/models` which is illustrated +below + +.. image:: dna_peak.png + :width: 70% + :alt: Prediction from DNA to peaks + :align: center + +After the model has been trained, the model parameters and the +illustration of the architecture are stored in :code:`<results_root>/models`. +Furthermore, information about the model fitting, model and dataset dimensions +are written to :code:`<results_root>/logs`. + +Note that in the example above the output dimensionality of the network is 4D. +However, it might be more convenient at times to remove the single dimensional +elements of the array. +This can be achieved by wrapping the LABELS dataset using :code:`ReduceDim`. +In this case the example becomes + +.. code:: python + + @inputlayer + @outputdense('sigmoid') + def double_stranded_model(inputs, inp, oup, params): + with inputs.use('dna') as layer: + # The DnaConv2D wrapper can be used with Conv2D + # to scan both DNA strands with the weight matrices. + layer = DnaConv2D(Conv2D(params[0], (params[1], 1), + activation=params[2]))(layer) + + output = GlobalAveragePooling2D(name='motif')(layer) + return inputs, output + + + # 3. instantiate and compile the model + model = Janggu.create(template=double_stranded_model, + modelparams=(30, 15, 'relu'), + inputs=DNA, outputs=LABELS) + model.compile(optimizer='adadelta', loss='binary_crossentropy', + metrics=['acc']) + + # 4. fit the model + model.fit(DNA, LABELS, epochs=100) + + +Part III) Evaluation and interpretation of the model +----------------------------------------------------- + +Janggu supports various methods to evaluate and interprete a trained model, +including evaluating summary scores, inspecting the results in +the built-in genome browser (see Part I), evaluating the integrated gradients +which allows to visualized input feature importance and by +offering support for variant effect predictions. +In this last part we will illustrate these aspects. + +Evaluation of summary scores +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +After the model has been trained, the quality of the predictions +is usually summarized by its agreement with the ground truth, e.g. by +evaluating the area under the ROC curve in a binary classification application +or by computing the correlation between predictions and targets in a regression +setting. + +For some commonly used evaluation criteria, the evaluate method directly allows +to determine and export the given metric results. +For example, for a classification task the following line +evaluates the ROC and PRC and exports a figure and a tsv file, respectively, +for each measure. + +.. code-block:: python + + model.evaluate(DNA_TEST, LABELS_TEST, callbacks=['roc', 'prc', 'auprc', 'auroc']) + +The results are stored in :code:`<results_root>/evaluation/{roc,prc}.png` +as well as :code:`<results_root>/evaluation/{auroc,auprc}.tsv`. + +Furthermore, for a regression setting it is possible to invoke + +.. code-block:: python + + model.evaluate(DNA_TEST, LABELS_TEST, callbacks=['cor', 'mae', 'mse', 'var_explained']) + +which evaluates the Pearson's correlation, the mean absolute error, the mean squared error +and the explained variance into tsv files. + + +It is also possible to customize the scoring callbacks by instantiating a +:code:`Scorer` objects which can be passed to +:code:`model.evaluate` and :code:`model.predict`. Further details about +customizing the scoring callbacks are given in :doc:`custom_scorer`. + +Input feature importance +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In order to inspect what the model has learned, +it is possible to identify the most important features in the input space +using the integrated gradients method. + +This is illustrated on a toy example for discriminating Oct4 and Mafk binding sites (see +`variant effect prediction`_). + +Variant effect prediction +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +In order to measure the effect of single nucleotide variant on the predict +network output can be tested via the :code:`Janggu.predict_variant_effect` +based on a Bioseq object and single nucleotide variants in VCF format. +This method evaluates the network for each variant (using its sequence context) +as well as its respective reference sequence. +As a result, an hdf5 file and a bed file will be produced which +contain the network predictions for each variant and the associated genomic +loci. +An illustration of the variant effect prediction in the notebook (see +`variant effect prediction`_). + +Browse through the results +^^^^^^^^^^^^^^^^^^^^^^^^^^ +Finally, after you have fitted and evaluated your results +you can browse through the results in an web browser of your choice. + +To this end, first start the web application server + +:: + + janggu -path <results-root> + +Then you can inspect the outputs in a browser of your choice +(default: localhost:8050) + +.. image:: janggu_example.png + :width: 70% + :alt: Prediction from DNA to peaks + :align: center