a b/README.md
1
=====================================
2
Janggu - Deep learning for Genomics
3
=====================================
4
5
.. start-badges
6
7
.. image:: https://readthedocs.org/projects/janggu/badge/?style=flat
8
    :target: https://janggu.readthedocs.io/en/latest
9
    :alt: Documentation Status
10
11
.. image:: https://travis-ci.org/BIMSBbioinfo/janggu.svg?branch=master
12
    :alt: Travis-CI Build Status
13
    :target: https://travis-ci.org/BIMSBbioinfo/janggu
14
15
.. image:: https://codecov.io/github/BIMSBbioinfo/janggu/coverage.svg?branch=master
16
    :alt: Coverage Status
17
    :target: https://codecov.io/github/BIMSBbioinfo/janggu
18
19
.. image:: https://badge.fury.io/py/janggu.svg
20
    :alt: PyPI Package latest release
21
    :target: https://pypi.org/project/janggu
22
23
.. image:: https://img.shields.io/pypi/l/janggu.svg?color=green
24
    :alt: License
25
    :target: https://pypi.org/project/janggu
26
27
.. image:: https://img.shields.io/pypi/pyversions/janggu.svg
28
    :alt: Supported Python Versions
29
    :target: https://pypi.org/project/janggu/
30
31
.. image:: https://pepy.tech/badge/janggu
32
    :alt: Downloads
33
    :target: https://pepy.tech/project/janggu
34
35
.. end-badges
36
37
.. image:: jangguhex.png
38
   :width: 40%
39
   :alt: Janggu logo
40
   :align: center
41
42
Janggu is a python package that facilitates deep learning in the context of
43
genomics. The package is freely available under a GPL-3.0 license.
44
45
.. image:: Janggu-visAbstract.png
46
   :width: 50%
47
   :alt: Janggu visual abstract
48
   :align: center
49
50
51
In particular, the package allows for easy access to
52
typical **Genomics data formats**
53
and **out-of-the-box evaluation** (for keras models specifically) so that you can concentrate
54
on designing the neural network architecture for the purpose
55
of quickly testing biological hypothesis.
56
A comprehensive documentation is available `here <https://janggu.readthedocs.io/en/latest>`_.
57
58
Hallmarks of Janggu:
59
---------------------
60
61
1. Janggu provides special **Genomics datasets** that allow you to access raw data in FASTA, BAM, BIGWIG, BED and GFF file format.
62
2. Various **normalization** procedures are supported for dealing with of the genomics dataset, including 'TPM', 'zscore' or custom normalizers.
63
3. Biological features can be represented in terms of higher-order sequence features, e.g. di-nucleotide based features.
64
4. The dataset objects are directly consumable with neural networks for example implemented using `keras <https://keras.io>`_ or using `scikit-learn <https://scikit-learn.org/stable/index.html>`_ (see src/examples in this repository).
65
5. Numpy format output of a keras model can be converted to represent genomic coverage tracks, which allows exporting the predictions as BIGWIG files and visualization of genome browser-like plots.
66
6. Genomic datasets can be stored in various ways, including as numpy array, sparse dataset or in hdf5 format.
67
7. Caching of Genomic datasets avoids time consuming preprocessing steps and facilitates fast reloading.
68
8. Janggu provides a wrapper for `keras <https://keras.io>`_ models with built-in logging functionality and automatized result evaluation.
69
9. Janggu supports input feature importance attribution using the integrated gradients method and variant effect prediction assessment.
70
10. Janggu provides a utilities such as keras layer for scanning both DNA strands for motif occurrences.
71
72
Getting started
73
----------------
74
75
Janggu makes it easy to access data from genomic file formats and utilize it for
76
machine learning purposes.
77
78
.. code-block:: python
79
80
  dna = Bioseq.create_from_genome('dna', refgenome=<refgenome.fa>, roi=<roi.bed>)
81
  labels = Cover.create_from_bed('labels', bedfiles=<labels.bed>, roi=<roi.bed>)
82
83
  kerasmodel.fit(dna, labels)
84
  
85
A range of examples can be found in './src/examples' of this repository,
86
which includes jupyter notebooks that illustrate Janggu's functionality
87
and how it can be used with popular deep learning frameworks, including
88
keras, sklearn or pytorch.
89
90
Why the name Janggu?
91
---------------------
92
93
`Janggu <https://en.wikipedia.org/wiki/Janggu>`_ is a Korean percussion
94
instrument that looks like an hourglass.
95
96
Like the two ends of the instrument, the philosophy of the
97
Janggu package is to help with the two ends of a
98
deep learning application in genomics,
99
namely data acquisition and evaluation.
100
101
102
103
Installation
104
============
105
106
A list of python dependencies is defined in `setup.py`.
107
Additionally, `bedtools <https://bedtools.readthedocs.io/>`_ is required for `pybedtools` which `janggu` depends on.
108
109
Janggu depends on tensorflow and keras.
110
To install janggu with tensorflow version 1 and 2 use
111
112
::
113
114
   # to install with tensorflow==1.14 and keras==2.2
115
   pip install janggu[tf] # or janggu[tf_gpu] 
116
117
   # to install with tensorflow==2.2 and keras==2.4.3
118
   pip install janggu[tf2] # or janggu[tf2_gpu] 
119
120
121
Depending on the pip version (e.g. 20.2.2),
122
some package dependencies may fail to be resolved
123
accurately such that incompatible package versions are installed.
124
If this is the case, you could try using
125
`pip install ... --use-feature=2020-resolver`
126
or install the required package version manually.
127
128
Alternatively, you can install tensorflow and keras via
129
the conda environment using
130
131
::
132
133
   # tensorflow v1
134
   conda install tensorflow==1.14 keras==2.2  # or tensorflow-gpu
135
136
   # tensorflow v2
137
   conda install tensorflow==2.2 keras==2.4.3  # or tensorflow-gpu
138
139
Further information regarding the installation of tensorflow can be found on
140
the official `tensorflow webpage <https://www.tensorflow.org>`_
141
142
To verify that the installation works try to run the example contained in the
143
janggu package as follows
144
145
::
146
147
   git clone https://github.com/BIMSBbioinfo/janggu
148
   cd janggu
149
   python ./src/examples/classify_fasta.py single
150
151
A model is then trained to predict the class labels of two sets of toy sequencesby scanning the forward strand for sequence patterns and using an ordinary mono-nucleotide one-hot sequence encoding.
152
The entire training process takes a few minutes on CPU backend.
153
Eventually, some example prediction scores are shown for Oct4 and Mafk sequences. The accuracy should be around 85% and individual example prediction scores should tend to be higher for Oct4 than for Mafk.
154
155
You may also try to rerun the training by evaluating sequences features on both
156
strands and using higher-order sequence encoding using i.e. the command-line arguments: `dnaconv -order 2`.
157
Accuracies and prediction scores for the individual example sequences should improve compared to the previous example.
158
159
Citation
160
========
161
162
| Kopp, W., Monti, R., Tamburrini, A., Ohler, U., Akalin, A. Deep learning for genomics using Janggu. Nat Commun 11, 3488 (2020). https://doi.org/10.1038/s41467-020-17155-y