|
a/README.md |
|
b/README.md |
1 |
# BPNet |
1 |
# BPNet |
2 |
[](https://circleci.com/gh/kundajelab/bpnet) |
2 |
|
3 |
|
|
|
4 |
BPNet is a python package with a CLI to train and interpret base-resolution deep neural networks trained on functional genomics data such as ChIP-nexus or ChIP-seq. It addresses the problem of pinpointing the regulatory elements in the genome: |
3 |
BPNet is a python package with a CLI to train and interpret base-resolution deep neural networks trained on functional genomics data such as ChIP-nexus or ChIP-seq. It addresses the problem of pinpointing the regulatory elements in the genome: |
5 |
|
4 |
|
6 |
<img src="./docs/theme_dir/bpnet/dna-words.png" alt="BPNet" style="width: 600px;"/> |
5 |
|
7 |
|
|
|
8 |
Specifically, it aims to answer the following questions: |
6 |
Specifically, it aims to answer the following questions:
|
9 |
- What are the sequence motifs? |
7 |
- What are the sequence motifs?
|
10 |
- Where are they located in the genome? |
8 |
- Where are they located in the genome?
|
11 |
- How do they interact? |
9 |
- How do they interact? |
12 |
|
10 |
|
13 |
For more information, see the BPNet manuscript: |
11 |
For more information, see the BPNet manuscript: |
14 |
|
12 |
|
15 |
*Deep learning at base-resolution reveals motif syntax of the cis-regulatory code* (http://dx.doi.org/10.1101/737981.) |
13 |
*Deep learning at base-resolution reveals motif syntax of the cis-regulatory code* (http://dx.doi.org/10.1101/737981.) |
16 |
|
14 |
|
17 |
## Overview |
15 |
|
18 |
|
|
|
19 |
<img src="./docs/theme_dir/bpnet/overview.png" alt="BPNet" style="width: 400px;"/> |
|
|
20 |
|
|
|
21 |
## Getting started |
16 |
## Getting started |
22 |
|
17 |
|
23 |
Main documentation of the bpnet package and an end-to-end example higlighting the main features are contained in the following colab notebook **<https://colab.research.google.com/drive/1VNsNBfugPJfJ02LBgvPwj-gPK0L_djsD>**. You can run this notebook yourself by clicking on '**Open in playground**'. Individual cells of this notebook can be executed by pressing the Shift+Enter keyboard shortcut. |
18 |
Main documentation of the bpnet package and an end-to-end example higlighting the main features are contained in the following colab notebook **<https://colab.research.google.com/drive/1VNsNBfugPJfJ02LBgvPwj-gPK0L_djsD>**. You can run this notebook yourself by clicking on '**Open in playground**'. Individual cells of this notebook can be executed by pressing the Shift+Enter keyboard shortcut. |
24 |
|
19 |
|
25 |
<img src="./docs/theme_dir/bpnet/colab-header.png" alt="BPNet" style="width: 300px;"/> |
20 |
|
26 |
|
|
|
27 |
To learn more about colab, visit <https://colab.research.google.com> and follow the 'Welcome To Colaboratory' notebook. |
21 |
To learn more about colab, visit <https://colab.research.google.com> and follow the 'Welcome To Colaboratory' notebook. |
28 |
|
22 |
|
29 |
## Main commands |
23 |
## Main commands |
30 |
|
24 |
|
31 |
Compute data statistics to inform hyper-parameter selection such as choosing to trade off profile vs total count loss (`lambda` hyper-parameter): |
25 |
Compute data statistics to inform hyper-parameter selection such as choosing to trade off profile vs total count loss (`lambda` hyper-parameter): |
32 |
|
26 |
|
33 |
```bash |
27 |
```bash
|
34 |
bpnet dataspec-stats dataspec.yml |
28 |
bpnet dataspec-stats dataspec.yml
|
35 |
``` |
29 |
``` |
36 |
|
30 |
|
37 |
Train a model on BigWig tracks specified in [dataspec.yml](examples/chip-nexus/dataspec.yml) using an existing architecture [bpnet9](bpnet/premade/bpnet9-pyspec.gin) on 200 bp sequences with 6 dilated convolutional layers: |
31 |
Train a model on BigWig tracks specified in [dataspec.yml](examples/chip-nexus/dataspec.yml) using an existing architecture [bpnet9](bpnet/premade/bpnet9-pyspec.gin) on 200 bp sequences with 6 dilated convolutional layers: |
38 |
|
32 |
|
39 |
```bash |
33 |
```bash
|
40 |
bpnet train --premade=bpnet9 dataspec.yml --override='seq_width=200;n_dil_layers=6' . |
34 |
bpnet train --premade=bpnet9 dataspec.yml --override='seq_width=200;n_dil_layers=6' .
|
41 |
``` |
35 |
``` |
42 |
|
36 |
|
43 |
Compute contribution scores for regions specified in the `dataspec.yml` file and store them into `contrib.scores.h5` |
37 |
Compute contribution scores for regions specified in the `dataspec.yml` file and store them into `contrib.scores.h5` |
44 |
|
38 |
|
45 |
```bash |
39 |
```bash
|
46 |
bpnet contrib . --method=deeplift contrib.scores.h5 |
40 |
bpnet contrib . --method=deeplift contrib.scores.h5
|
47 |
``` |
41 |
``` |
48 |
|
42 |
|
49 |
Export BigWig tracks containing model predictions and contribution scores |
43 |
Export BigWig tracks containing model predictions and contribution scores |
50 |
|
44 |
|
51 |
```bash |
45 |
```bash
|
52 |
bpnet export-bw . --regions=intervals.bed --scale-contribution bigwigs/ |
46 |
bpnet export-bw . --regions=intervals.bed --scale-contribution bigwigs/
|
53 |
``` |
47 |
``` |
54 |
|
48 |
|
55 |
Discover motifs with TF-MoDISco using contribution scores stored in `contrib.scores.h5`, premade configuration [modisco-50k](bpnet/premade/modisco-50k.gin) and restricting the number of seqlets per metacluster to 20k: |
49 |
Discover motifs with TF-MoDISco using contribution scores stored in `contrib.scores.h5`, premade configuration [modisco-50k](bpnet/premade/modisco-50k.gin) and restricting the number of seqlets per metacluster to 20k: |
56 |
|
50 |
|
57 |
```bash |
51 |
```bash
|
58 |
bpnet modisco-run contrib.scores.h5 --premade=modisco-50k --override='TfModiscoWorkflow.max_seqlets_per_metacluster=20000' modisco/ |
52 |
bpnet modisco-run contrib.scores.h5 --premade=modisco-50k --override='TfModiscoWorkflow.max_seqlets_per_metacluster=20000' modisco/
|
59 |
``` |
53 |
``` |
60 |
|
54 |
|
61 |
Determine motif instances with CWM scanning and store them to `motif-instances.tsv.gz` |
55 |
Determine motif instances with CWM scanning and store them to `motif-instances.tsv.gz` |
62 |
|
56 |
|
63 |
```bash |
57 |
```bash
|
64 |
bpnet cwm-scan modisco/ --contrib-file=contrib.scores.h5 modisco/motif-instances.tsv.gz |
58 |
bpnet cwm-scan modisco/ --contrib-file=contrib.scores.h5 modisco/motif-instances.tsv.gz
|
65 |
``` |
59 |
``` |
66 |
|
60 |
|
67 |
Generate additional reports suitable for ChIP-nexus or ChIP-seq data: |
61 |
Generate additional reports suitable for ChIP-nexus or ChIP-seq data: |
68 |
|
62 |
|
69 |
```bash |
63 |
```bash
|
70 |
bpnet chip-nexus-analysis modisco/ |
64 |
bpnet chip-nexus-analysis modisco/
|
71 |
``` |
65 |
``` |
72 |
|
66 |
|
73 |
Note: these commands are also accessible as python functions: |
67 |
Note: these commands are also accessible as python functions:
|
74 |
- `bpnet.cli.train.bpnet_train` |
68 |
- `bpnet.cli.train.bpnet_train`
|
75 |
- `bpnet.cli.train.dataspec_stats` |
69 |
- `bpnet.cli.train.dataspec_stats`
|
76 |
- `bpnet.cli.contrib.bpnet_contrib` |
70 |
- `bpnet.cli.contrib.bpnet_contrib`
|
77 |
- `bpnet.cli.export_bw.bpnet_export_bw` |
71 |
- `bpnet.cli.export_bw.bpnet_export_bw`
|
78 |
- `bpnet.cli.modisco.bpnet_modisco_run` |
72 |
- `bpnet.cli.modisco.bpnet_modisco_run`
|
79 |
- `bpnet.cli.modisco.cwm_scan` |
73 |
- `bpnet.cli.modisco.cwm_scan`
|
80 |
- `bpnet.cli.modisco.chip_nexus_analysis` |
74 |
- `bpnet.cli.modisco.chip_nexus_analysis` |
81 |
|
75 |
|
82 |
## Main python classes |
76 |
## Main python classes |
83 |
|
77 |
|
84 |
- `bpnet.seqmodel.SeqModel` - Keras model container specified by implementing output 'heads' and a common 'body'. It contains methods to compute the contribution scores of the input sequence w.r.t. differnet output heads. |
78 |
- `bpnet.seqmodel.SeqModel` - Keras model container specified by implementing output 'heads' and a common 'body'. It contains methods to compute the contribution scores of the input sequence w.r.t. differnet output heads.
|
85 |
- `bpnet.BPNet.BPNetSeqModel` - Wrapper around `SeqModel` consolidating profile and total count predictions into a single output per task. It provides methods to export predictions and contribution scores to BigWig files as well as methods to simulate the spacing between two motifs. |
79 |
- `bpnet.BPNet.BPNetSeqModel` - Wrapper around `SeqModel` consolidating profile and total count predictions into a single output per task. It provides methods to export predictions and contribution scores to BigWig files as well as methods to simulate the spacing between two motifs.
|
86 |
- `bpnet.cli.contrib.ContribFile` - File handle to the HDF5 containing the contribution scores |
80 |
- `bpnet.cli.contrib.ContribFile` - File handle to the HDF5 containing the contribution scores
|
87 |
- `bpnet.modisco.files.ModiscoFile` - File handle to the HDF5 file produced by TF-MoDISco. |
81 |
- `bpnet.modisco.files.ModiscoFile` - File handle to the HDF5 file produced by TF-MoDISco.
|
88 |
- `bpnet.modisco.core.Pattern` - Object containing the PFM, CWM and optionally the signal footprint |
82 |
- `bpnet.modisco.core.Pattern` - Object containing the PFM, CWM and optionally the signal footprint
|
89 |
- `bpnet.modisco.core.Seqlet` - Object containing the seqlet coordinates. |
83 |
- `bpnet.modisco.core.Seqlet` - Object containing the seqlet coordinates.
|
90 |
- `bpnet.modisco.core.StackedSeqletContrib` - Object containing the sequence, contribution scores and raw data at seqlet locations. |
84 |
- `bpnet.modisco.core.StackedSeqletContrib` - Object containing the sequence, contribution scores and raw data at seqlet locations.
|
91 |
- `bpnet.dataspecs.DataSpec` - File handle to the `dataspec.yml` file |
85 |
- `bpnet.dataspecs.DataSpec` - File handle to the `dataspec.yml` file
|
92 |
- `dfi` - Frequently used alias for a pandas `DataFrame` containing motif instance coordinates produced by `bpnet cwm-scan`. See the [colab notebook](https://colab.research.google.com/drive/1VNsNBfugPJfJ02LBgvPwj-gPK0L_djsD) for the column description. |
86 |
- `dfi` - Frequently used alias for a pandas `DataFrame` containing motif instance coordinates produced by `bpnet cwm-scan`. See the [colab notebook](https://colab.research.google.com/drive/1VNsNBfugPJfJ02LBgvPwj-gPK0L_djsD) for the column description. |
93 |
|
87 |
|
94 |
## Installation |
88 |
## Installation |
95 |
|
89 |
|
96 |
Supported python version is 3.6. After installing anaconda ([download page](https://www.anaconda.com/download/)) or miniconda ([download page](https://conda.io/miniconda.html)), create a new bpnet environment by executing the following code: |
90 |
Supported python version is 3.6. After installing anaconda ([download page](https://www.anaconda.com/download/)) or miniconda ([download page](https://conda.io/miniconda.html)), create a new bpnet environment by executing the following code: |
97 |
|
91 |
|
98 |
```bash |
92 |
```bash
|
99 |
# Clone this repository |
93 |
# Clone this repository
|
100 |
git clone git@github.com:kundajelab/bpnet.git |
94 |
git clone git@github.com:kundajelab/bpnet.git
|
101 |
cd bpnet |
95 |
cd bpnet |
102 |
|
96 |
|
103 |
# create 'bpnet' conda environment |
97 |
# create 'bpnet' conda environment
|
104 |
conda env create -f conda-env.yml |
98 |
conda env create -f conda-env.yml |
105 |
|
99 |
|
106 |
# Disable HDF5 file locking to prevent issues with Keras (https://github.com/h5py/h5py/issues/1082) |
100 |
# Disable HDF5 file locking to prevent issues with Keras (https://github.com/h5py/h5py/issues/1082)
|
107 |
echo 'export HDF5_USE_FILE_LOCKING=FALSE' >> ~/.bashrc |
101 |
echo 'export HDF5_USE_FILE_LOCKING=FALSE' >> ~/.bashrc |
108 |
|
102 |
|
109 |
# Activate the conda environment |
103 |
# Activate the conda environment
|
110 |
source activate bpnet |
104 |
source activate bpnet
|
111 |
``` |
105 |
``` |
112 |
|
106 |
|
113 |
Alternatively, you could also start a fresh conda environment by running the following |
107 |
Alternatively, you could also start a fresh conda environment by running the following |
114 |
|
108 |
|
115 |
```bash |
109 |
```bash
|
116 |
conda env create -n bpnet python=3.6 |
110 |
conda env create -n bpnet python=3.6
|
117 |
source activate bpnet |
111 |
source activate bpnet
|
118 |
conda install -c bioconda pybedtools bedtools pybigwig pysam genomelake |
112 |
conda install -c bioconda pybedtools bedtools pybigwig pysam genomelake
|
119 |
pip install git+https://github.com/kundajelab/DeepExplain.git |
113 |
pip install git+https://github.com/kundajelab/DeepExplain.git
|
120 |
pip install tensorflow~=1.0 # or tensorflow-gpu if you are using a GPU |
114 |
pip install tensorflow~=1.0 # or tensorflow-gpu if you are using a GPU
|
121 |
pip install bpnet |
115 |
pip install bpnet
|
122 |
echo 'export HDF5_USE_FILE_LOCKING=FALSE' >> ~/.bashrc |
116 |
echo 'export HDF5_USE_FILE_LOCKING=FALSE' >> ~/.bashrc
|
123 |
``` |
117 |
``` |
124 |
|
118 |
|
125 |
When using bpnet from the command line, don't forget to activate the `bpnet` conda environment before: |
119 |
When using bpnet from the command line, don't forget to activate the `bpnet` conda environment before: |
126 |
|
120 |
|
127 |
```bash |
121 |
```bash
|
128 |
# activate the bpnet conda environment |
122 |
# activate the bpnet conda environment
|
129 |
source activate bpnet |
123 |
source activate bpnet |
130 |
|
124 |
|
131 |
# run bpnet |
125 |
# run bpnet
|
132 |
bpnet <command> ... |
126 |
bpnet <command> ...
|
133 |
``` |
127 |
``` |
134 |
|
128 |
|
135 |
### (Optional) Install `vmtouch` to use `bpnet train --vmtouch` |
129 |
### (Optional) Install `vmtouch` to use `bpnet train --vmtouch` |
136 |
|
130 |
|
137 |
To use the `--vmtouch` in `bpnet train` command and thereby speed-up data-loading, install [vmtouch](https://hoytech.com/vmtouch/). vmtouch is used to load the bigWig files into system memory cache which allows multiple processes to access |
131 |
To use the `--vmtouch` in `bpnet train` command and thereby speed-up data-loading, install [vmtouch](https://hoytech.com/vmtouch/). vmtouch is used to load the bigWig files into system memory cache which allows multiple processes to access
|
138 |
the bigWigs loaded into memory. |
132 |
the bigWigs loaded into memory. |
139 |
|
133 |
|
140 |
Here's how to build and install vmtouch: |
134 |
Here's how to build and install vmtouch: |
141 |
|
135 |
|
142 |
```bash |
136 |
```bash
|
143 |
# ~/bin = directory for localy compiled binaries |
137 |
# ~/bin = directory for localy compiled binaries
|
144 |
mkdir -p ~/bin |
138 |
mkdir -p ~/bin
|
145 |
cd ~/bin |
139 |
cd ~/bin
|
146 |
# Clone and build |
140 |
# Clone and build
|
147 |
git clone https://github.com/hoytech/vmtouch.git vmtouch_src |
141 |
git clone https://github.com/hoytech/vmtouch.git vmtouch_src
|
148 |
cd vmtouch_src |
142 |
cd vmtouch_src
|
149 |
make |
143 |
make
|
150 |
# Move the binary to ~/bin |
144 |
# Move the binary to ~/bin
|
151 |
cp vmtouch ../ |
145 |
cp vmtouch ../
|
152 |
# Add ~/bin to $PATH |
146 |
# Add ~/bin to $PATH
|
153 |
echo 'export PATH=$PATH:~/bin' >> ~/.bashrc |
147 |
echo 'export PATH=$PATH:~/bin' >> ~/.bashrc
|
154 |
``` |
148 |
```
|