Switch to unified view

a/README.md b/README.md
1
# BPNet
1
# BPNet
2
[![CircleCI](https://circleci.com/gh/kundajelab/bpnet.svg?style=svg&circle-token=f55c1cf580b05df76e260993f7645e35d5302e76)](https://circleci.com/gh/kundajelab/bpnet)
2
3
4
BPNet is a python package with a CLI to train and interpret base-resolution deep neural networks trained on functional genomics data such as ChIP-nexus or ChIP-seq. It addresses the problem of pinpointing the regulatory elements in the genome:
3
BPNet is a python package with a CLI to train and interpret base-resolution deep neural networks trained on functional genomics data such as ChIP-nexus or ChIP-seq. It addresses the problem of pinpointing the regulatory elements in the genome:
5
4
6
<img src="./docs/theme_dir/bpnet/dna-words.png" alt="BPNet" style="width: 600px;"/>
5
7
8
Specifically, it aims to answer the following questions:
6
Specifically, it aims to answer the following questions:
9
- What are the sequence motifs?
7
- What are the sequence motifs?
10
- Where are they located in the genome?
8
- Where are they located in the genome?
11
- How do they interact?
9
- How do they interact?
12
10
13
For more information, see the BPNet manuscript:
11
For more information, see the BPNet manuscript:
14
12
15
*Deep learning at base-resolution reveals motif syntax of the cis-regulatory code* (http://dx.doi.org/10.1101/737981.)
13
*Deep learning at base-resolution reveals motif syntax of the cis-regulatory code* (http://dx.doi.org/10.1101/737981.)
16
14
17
## Overview
15
18
19
<img src="./docs/theme_dir/bpnet/overview.png" alt="BPNet" style="width: 400px;"/>
20
21
## Getting started
16
## Getting started
22
17
23
Main documentation of the bpnet package and an end-to-end example higlighting the main features are contained in the following colab notebook **<https://colab.research.google.com/drive/1VNsNBfugPJfJ02LBgvPwj-gPK0L_djsD>**. You can run this notebook yourself by clicking on '**Open in playground**'. Individual cells of this notebook can be executed by pressing the Shift+Enter keyboard shortcut.
18
Main documentation of the bpnet package and an end-to-end example higlighting the main features are contained in the following colab notebook **<https://colab.research.google.com/drive/1VNsNBfugPJfJ02LBgvPwj-gPK0L_djsD>**. You can run this notebook yourself by clicking on '**Open in playground**'. Individual cells of this notebook can be executed by pressing the Shift+Enter keyboard shortcut.
24
19
25
<img src="./docs/theme_dir/bpnet/colab-header.png" alt="BPNet" style="width: 300px;"/>
20
26
27
To learn more about colab, visit <https://colab.research.google.com> and follow the 'Welcome To Colaboratory' notebook.
21
To learn more about colab, visit <https://colab.research.google.com> and follow the 'Welcome To Colaboratory' notebook.
28
22
29
## Main commands
23
## Main commands
30
24
31
Compute data statistics to inform hyper-parameter selection such as choosing to trade off profile vs total count loss (`lambda` hyper-parameter):
25
Compute data statistics to inform hyper-parameter selection such as choosing to trade off profile vs total count loss (`lambda` hyper-parameter):
32
26
33
```bash
27
```bash
34
bpnet dataspec-stats dataspec.yml
28
bpnet dataspec-stats dataspec.yml
35
```
29
```
36
30
37
Train a model on BigWig tracks specified in [dataspec.yml](examples/chip-nexus/dataspec.yml) using an existing architecture [bpnet9](bpnet/premade/bpnet9-pyspec.gin) on 200 bp sequences with 6 dilated convolutional layers:
31
Train a model on BigWig tracks specified in [dataspec.yml](examples/chip-nexus/dataspec.yml) using an existing architecture [bpnet9](bpnet/premade/bpnet9-pyspec.gin) on 200 bp sequences with 6 dilated convolutional layers:
38
32
39
```bash
33
```bash
40
bpnet train --premade=bpnet9 dataspec.yml --override='seq_width=200;n_dil_layers=6' .
34
bpnet train --premade=bpnet9 dataspec.yml --override='seq_width=200;n_dil_layers=6' .
41
```
35
```
42
36
43
Compute contribution scores for regions specified in the `dataspec.yml` file and store them into `contrib.scores.h5`
37
Compute contribution scores for regions specified in the `dataspec.yml` file and store them into `contrib.scores.h5`
44
38
45
```bash
39
```bash
46
bpnet contrib . --method=deeplift contrib.scores.h5
40
bpnet contrib . --method=deeplift contrib.scores.h5
47
```
41
```
48
42
49
Export BigWig tracks containing model predictions and contribution scores
43
Export BigWig tracks containing model predictions and contribution scores
50
44
51
```bash
45
```bash
52
bpnet export-bw . --regions=intervals.bed --scale-contribution bigwigs/
46
bpnet export-bw . --regions=intervals.bed --scale-contribution bigwigs/
53
```
47
```
54
48
55
Discover motifs with TF-MoDISco using contribution scores stored in `contrib.scores.h5`, premade configuration [modisco-50k](bpnet/premade/modisco-50k.gin) and restricting the number of seqlets per metacluster to 20k:
49
Discover motifs with TF-MoDISco using contribution scores stored in `contrib.scores.h5`, premade configuration [modisco-50k](bpnet/premade/modisco-50k.gin) and restricting the number of seqlets per metacluster to 20k:
56
50
57
```bash
51
```bash
58
bpnet modisco-run contrib.scores.h5 --premade=modisco-50k --override='TfModiscoWorkflow.max_seqlets_per_metacluster=20000' modisco/
52
bpnet modisco-run contrib.scores.h5 --premade=modisco-50k --override='TfModiscoWorkflow.max_seqlets_per_metacluster=20000' modisco/
59
```
53
```
60
54
61
Determine motif instances with CWM scanning and store them to `motif-instances.tsv.gz`
55
Determine motif instances with CWM scanning and store them to `motif-instances.tsv.gz`
62
56
63
```bash
57
```bash
64
bpnet cwm-scan modisco/ --contrib-file=contrib.scores.h5 modisco/motif-instances.tsv.gz
58
bpnet cwm-scan modisco/ --contrib-file=contrib.scores.h5 modisco/motif-instances.tsv.gz
65
```
59
```
66
60
67
Generate additional reports suitable for ChIP-nexus or ChIP-seq data:
61
Generate additional reports suitable for ChIP-nexus or ChIP-seq data:
68
62
69
```bash
63
```bash
70
bpnet chip-nexus-analysis modisco/
64
bpnet chip-nexus-analysis modisco/
71
```
65
```
72
66
73
Note: these commands are also accessible as python functions:
67
Note: these commands are also accessible as python functions:
74
- `bpnet.cli.train.bpnet_train`
68
- `bpnet.cli.train.bpnet_train`
75
- `bpnet.cli.train.dataspec_stats`
69
- `bpnet.cli.train.dataspec_stats`
76
- `bpnet.cli.contrib.bpnet_contrib`
70
- `bpnet.cli.contrib.bpnet_contrib`
77
- `bpnet.cli.export_bw.bpnet_export_bw`
71
- `bpnet.cli.export_bw.bpnet_export_bw`
78
- `bpnet.cli.modisco.bpnet_modisco_run`
72
- `bpnet.cli.modisco.bpnet_modisco_run`
79
- `bpnet.cli.modisco.cwm_scan`
73
- `bpnet.cli.modisco.cwm_scan`
80
- `bpnet.cli.modisco.chip_nexus_analysis`
74
- `bpnet.cli.modisco.chip_nexus_analysis`
81
75
82
## Main python classes
76
## Main python classes
83
77
84
- `bpnet.seqmodel.SeqModel` - Keras model container specified by implementing output 'heads' and a common 'body'. It contains methods to compute the contribution scores of the input sequence w.r.t. differnet output heads.
78
- `bpnet.seqmodel.SeqModel` - Keras model container specified by implementing output 'heads' and a common 'body'. It contains methods to compute the contribution scores of the input sequence w.r.t. differnet output heads.
85
- `bpnet.BPNet.BPNetSeqModel` - Wrapper around `SeqModel` consolidating profile and total count predictions into a single output per task. It provides methods to export predictions and contribution scores to BigWig files as well as methods to simulate the spacing between two motifs.
79
- `bpnet.BPNet.BPNetSeqModel` - Wrapper around `SeqModel` consolidating profile and total count predictions into a single output per task. It provides methods to export predictions and contribution scores to BigWig files as well as methods to simulate the spacing between two motifs.
86
- `bpnet.cli.contrib.ContribFile` - File handle to the HDF5 containing the contribution scores
80
- `bpnet.cli.contrib.ContribFile` - File handle to the HDF5 containing the contribution scores
87
- `bpnet.modisco.files.ModiscoFile` - File handle to the HDF5 file produced by TF-MoDISco.
81
- `bpnet.modisco.files.ModiscoFile` - File handle to the HDF5 file produced by TF-MoDISco.
88
  - `bpnet.modisco.core.Pattern` - Object containing the PFM, CWM and optionally the signal footprint
82
  - `bpnet.modisco.core.Pattern` - Object containing the PFM, CWM and optionally the signal footprint
89
  - `bpnet.modisco.core.Seqlet` - Object containing the seqlet coordinates.
83
  - `bpnet.modisco.core.Seqlet` - Object containing the seqlet coordinates.
90
  - `bpnet.modisco.core.StackedSeqletContrib` - Object containing the sequence, contribution scores and raw data at seqlet locations.
84
  - `bpnet.modisco.core.StackedSeqletContrib` - Object containing the sequence, contribution scores and raw data at seqlet locations.
91
- `bpnet.dataspecs.DataSpec` - File handle to the `dataspec.yml` file
85
- `bpnet.dataspecs.DataSpec` - File handle to the `dataspec.yml` file
92
- `dfi` - Frequently used alias for a pandas `DataFrame` containing motif instance coordinates produced by `bpnet cwm-scan`. See the [colab notebook](https://colab.research.google.com/drive/1VNsNBfugPJfJ02LBgvPwj-gPK0L_djsD) for the column description.
86
- `dfi` - Frequently used alias for a pandas `DataFrame` containing motif instance coordinates produced by `bpnet cwm-scan`. See the [colab notebook](https://colab.research.google.com/drive/1VNsNBfugPJfJ02LBgvPwj-gPK0L_djsD) for the column description.
93
87
94
## Installation
88
## Installation
95
89
96
Supported python version is 3.6. After installing anaconda ([download page](https://www.anaconda.com/download/)) or miniconda ([download page](https://conda.io/miniconda.html)), create a new bpnet environment by executing the following code:
90
Supported python version is 3.6. After installing anaconda ([download page](https://www.anaconda.com/download/)) or miniconda ([download page](https://conda.io/miniconda.html)), create a new bpnet environment by executing the following code:
97
91
98
```bash
92
```bash
99
# Clone this repository
93
# Clone this repository
100
git clone git@github.com:kundajelab/bpnet.git
94
git clone git@github.com:kundajelab/bpnet.git
101
cd bpnet
95
cd bpnet
102
96
103
# create 'bpnet' conda environment
97
# create 'bpnet' conda environment
104
conda env create -f conda-env.yml
98
conda env create -f conda-env.yml
105
99
106
# Disable HDF5 file locking to prevent issues with Keras (https://github.com/h5py/h5py/issues/1082)
100
# Disable HDF5 file locking to prevent issues with Keras (https://github.com/h5py/h5py/issues/1082)
107
echo 'export HDF5_USE_FILE_LOCKING=FALSE' >> ~/.bashrc
101
echo 'export HDF5_USE_FILE_LOCKING=FALSE' >> ~/.bashrc
108
102
109
# Activate the conda environment
103
# Activate the conda environment
110
source activate bpnet
104
source activate bpnet
111
```
105
```
112
106
113
Alternatively, you could also start a fresh conda environment by running the following
107
Alternatively, you could also start a fresh conda environment by running the following
114
108
115
```bash
109
```bash
116
conda env create -n bpnet python=3.6
110
conda env create -n bpnet python=3.6
117
source activate bpnet
111
source activate bpnet
118
conda install -c bioconda pybedtools bedtools pybigwig pysam genomelake
112
conda install -c bioconda pybedtools bedtools pybigwig pysam genomelake
119
pip install git+https://github.com/kundajelab/DeepExplain.git
113
pip install git+https://github.com/kundajelab/DeepExplain.git
120
pip install tensorflow~=1.0 # or tensorflow-gpu if you are using a GPU
114
pip install tensorflow~=1.0 # or tensorflow-gpu if you are using a GPU
121
pip install bpnet
115
pip install bpnet
122
echo 'export HDF5_USE_FILE_LOCKING=FALSE' >> ~/.bashrc
116
echo 'export HDF5_USE_FILE_LOCKING=FALSE' >> ~/.bashrc
123
```
117
```
124
118
125
When using bpnet from the command line, don't forget to activate the `bpnet` conda environment before:
119
When using bpnet from the command line, don't forget to activate the `bpnet` conda environment before:
126
120
127
```bash
121
```bash
128
# activate the bpnet conda environment
122
# activate the bpnet conda environment
129
source activate bpnet
123
source activate bpnet
130
124
131
# run bpnet
125
# run bpnet
132
bpnet <command> ...
126
bpnet <command> ...
133
```
127
```
134
128
135
### (Optional) Install `vmtouch` to use `bpnet train --vmtouch`
129
### (Optional) Install `vmtouch` to use `bpnet train --vmtouch`
136
130
137
To use the `--vmtouch` in `bpnet train` command and thereby speed-up data-loading, install [vmtouch](https://hoytech.com/vmtouch/). vmtouch is used to load the bigWig files into system memory cache which allows multiple processes to access
131
To use the `--vmtouch` in `bpnet train` command and thereby speed-up data-loading, install [vmtouch](https://hoytech.com/vmtouch/). vmtouch is used to load the bigWig files into system memory cache which allows multiple processes to access
138
the bigWigs loaded into memory. 
132
the bigWigs loaded into memory. 
139
133
140
Here's how to build and install vmtouch:
134
Here's how to build and install vmtouch:
141
135
142
```bash
136
```bash
143
# ~/bin = directory for localy compiled binaries
137
# ~/bin = directory for localy compiled binaries
144
mkdir -p ~/bin
138
mkdir -p ~/bin
145
cd ~/bin
139
cd ~/bin
146
# Clone and build
140
# Clone and build
147
git clone https://github.com/hoytech/vmtouch.git vmtouch_src
141
git clone https://github.com/hoytech/vmtouch.git vmtouch_src
148
cd vmtouch_src
142
cd vmtouch_src
149
make
143
make
150
# Move the binary to ~/bin
144
# Move the binary to ~/bin
151
cp vmtouch ../
145
cp vmtouch ../
152
# Add ~/bin to $PATH
146
# Add ~/bin to $PATH
153
echo 'export PATH=$PATH:~/bin' >> ~/.bashrc
147
echo 'export PATH=$PATH:~/bin' >> ~/.bashrc
154
```
148
```