Diff of /README.md [000000] .. [4d6235]

Switch to unified view

a b/README.md
1
## Project YAML file.
2
3
An inDrops project is composed of a series of sequencing runs, each including one (or several) indrops libraries within it. A sequencing run can further be split into several parts (effectively arbitrary chunks) to parallelize the analysis. Give example of a chunk
4
5
The project yaml file contains the details of all sequencing runs and libraries within a project. 
6
7
The same project can contain runs from different versions of the inDrops platform. 
8
9
A project will be aligned against the same reference genome with the same alignment parameters. 
10
11
## Supported library versions
12
   - v1 : original design where R2 is the biological read and R1 is the metadata read. 
13
   - v2 : inversion of v1 where R1 is the biological read and R2 is the metadata read.
14
   - v3 : summer 2016 redesign requiring manual demultiplexing. R1 is the biological read.
15
          R2 carries the first half of the gel barcode, R3 carries the library index and R4
16
          the second half of the gel barcode, the UMI and a fraction of the polyA tail.
17
18
## Installation
19
The package requires
20
  - Python 2.7 (with the packages numpy, scipy, matplotlib, pysam>0.9.0, pyyaml, pyfasta). [See Appendix 2]
21
  - RSEM (1.2.19+)
22
  - Bowtie (1.1.1+)
23
  - samtools (1.3.1+) [See Appendix 3] *This specific version is needed to account for a BAM-format oddity in RSEM output.
24
  - Java 
25
The path to the directories containing these executables should be set in the project YAML.
26
If these executables can be found in the PATH variables, this project YAML paths can be left empty, or not specified.
27
28
### March 7th Notice -- PySAM version.
29
30
Previous installation instructions install PySAM version 0.6.0. To install the correct PySAM version, use the following commands:
31
32
  conda remove pysam
33
  conda install pip
34
  pip install pysam==0.9.1
35
36
37
## Project YAML file
38
39
An example YAML file is provided in `test/test_project.yaml`. It should contain the following information:
40
41
    project_name : "project_name"
42
    project_dir : "/path/to/project/dir"  #This dir should be user-owned and writable, all output will go into this dir.
43
    paths : 
44
      bowtie_index : "/path/to/index" #This index will be built automatically
45
      # The paths below can be omitted if the relevant directories are already on $PATH
46
      bowtie_dir : "/path/to/bowtie/dir/"
47
      python_dir : "/path/to/env/bins/"
48
      java_dir: "/path/to/java/dir/"
49
      rsem_dir: "/path/to/rsem/dir/"
50
      samtools_dir: "/path/to/samtools-1.3.1/bin/" #This needs to be version 1.3.1, 1.3 is not good enough!
51
52
    sequencing_runs : 
53
      # A list of sequencing runs which form the project. 
54
      # Each run should have:
55
      - name : "MyRun" # The name of the run will be used as a prefix in filenames, so keep it sane.
56
        version : "vN" # Can be 'v1', 'v2' or 'v3'
57
58
      # For a run with a single 'part', and a single library
59
        dir : "/path/to/run_files/"
60
        fastq_path : "{read}.fastq.gz" # Read with be replaced by R1, R2, R3, R4 as appropriate.
61
        library_name : "my_library"
62
63
        # This will expect to find the files:
64
        #    /path/to/run_files/R1.fastq.gz (and R2...)
65
66
       # For a run with several parts, but a single library
67
        dir : "/path/to/run_files/"
68
        fastq_path : "{split_affix}_{read}.fastq.gz" # Read with be replaced by R1, R2, R3, R4 as appropriate.
69
        split_affixes : ["L001", "L002"]
70
        library_name : "my_library"
71
72
        # This will expect to find the files:
73
        #    /path/to/run_files/L001_R1.fastq.gz (and R2...)
74
        #    /path/to/run_files/L002_R1.fastq.gz (and R2...)
75
76
       # For a run with several parts, several libraries, that have already been demultiplexed
77
        dir : "/path/to/run_files/"
78
        fastq_path : "{library_prefix}_{split_affix}_{read}.fastq.gz" # Read with be replaced by R1, R2, R3, R4 as appropriate.
79
        split_affixes : ["L001", "L002"]
80
        libraries : 
81
          - {library_name: "test_lib1", library_prefix: "lib1"}
82
          - {library_name: "test_lib2", library_prefix: "lib2"}
83
84
        # This will expect to find the files:
85
        #    /path/to/run_files/lib1_L001_R1.fastq.gz (and R2...)
86
        #    /path/to/run_files/lib1_L002_R1.fastq.gz (and R2...)
87
        #    /path/to/run_files/lib2_L001_R1.fastq.gz (and R2...)
88
        #    /path/to/run_files/lib2_L002_R1.fastq.gz (and R2...)
89
90
       # For a V3 run with several parts, with several libraries that are not already demultiplexed
91
        dir : "/path/to/run_files/"
92
        fastq_path : "{library_prefix}_{split_affix}_{read}.fastq.gz" # Read with be replaced by R1, R2, R3, R4 as appropriate.
93
        split_affixes : ["L001", "L002", "L003", "L004"]
94
        libraries :  # The library index is what the expected index read sequence (on a NextSeq, this is the reverse complement of the index sequence)
95
          - {library_name: "test_lib3", library_index: "ATAGAG"}
96
          - {library_name: "test_lib4", library_index: "AGAGGA"}
97
98
        # This will expect to find the files:
99
        #    /path/to/run_files/lib1_L001_R1.fastq.gz (and R2, R3, R4...)
100
        #    /path/to/run_files/lib1_L002_R1.fastq.gz (and R2, R3, R4...)
101
        #    /path/to/run_files/lib1_L003_R1.fastq.gz (and R2, R3, R4...)
102
        #    /path/to/run_files/lib1_L004_R1.fastq.gz (and R2, R3, R4...)
103
104
#### Note about v3 runs. 
105
The raw BCL files are needed for manual demultiplexing. Move the raw BCL files to a run directory, then use the following command to extract the R1,R2,R3 and R4 files.
106
107
    cd /run/dir/
108
    bcl2fastq --use-bases-mask y*,y*,y*,y* --mask-short-adapter-reads 0 --minimum-trimmed-read-length 0
109
    # The 'dir' used in the project YAML file should then be:
110
    #   /run/dir/Data/Intensities/BaseCalls/
111
112
## Analysis steps
113
114
### 0. Building a transcriptome bowtie index.
115
116
The index comprises a bowtie transcriptome index. It is built using RSEM so we can use `rsem-tbam2gbam` to convert between transcriptome and genome coordinates.
117
The index also has an annotation of which locations are soft-masked (denoting low-complexity regions) to allow filtering of alignments primarily found in soft-masked regions.
118
119
The index used by a project is give by `paths:bowtie_index` in the project YAML file. Several projects can share the same index. 
120
Example:
121
    paths : 
122
      bowtie_index : "/path/to/index_dir/indrops_ensembl_GRCh38_rel85/Homo_sapiens.GRCh38.85.annotated"
123
      ...more paths
124
125
If no index exists, it needs to be built
126
127
#### Example creation of a human index, using ENSEMBL release 85
128
    mkdir -pv DOWNLOAD_DIR
129
    cd DOWNLOAD_DIR
130
131
    # Download the soft-masked, primary assembly Genome Fasta file
132
    wget ftp://ftp.ensembl.org/pub/release-85/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
133
134
    # Download the corresponding GTF file.
135
    wget ftp://ftp.ensembl.org/pub/release-85/gtf/homo_sapiens/Homo_sapiens.GRCh38.85.gtf.gz
136
    
137
    # This command will go through all the steps for creating the index
138
    python indrops.py project.yaml build_index \
139
        --genome-fasta-gz DOWNLOAD_DIR/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz \
140
        --ensembl-gtf-gz DOWNLOAD_DIR/Homo_sapiens.GRCh38.85.gtf.gz
141
142
#### Example creation of a mouse index, using ENSEMBL release 85
143
    mkdir -pv DOWNLOAD_DIR
144
    cd DOWNLOAD_DIR
145
146
    # Download the soft-masked, primary assembly Genome Fasta file
147
    wget ftp://ftp.ensembl.org/pub/release-85/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.gz
148
149
    # Download the corresponding GTF file.
150
    wget ftp://ftp.ensembl.org/pub/release-85/gtf/mus_musculus/Mus_musculus.GRCm38.85.gtf.gz
151
    
152
    # This command will go through all the steps for creating the index
153
    python indrops.py project.yaml build_index \
154
        --genome-fasta-gz DOWNLOAD_DIR/Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.gz \
155
        --ensembl-gtf-gz DOWNLOAD_DIR/Mus_musculus.GRCm38.85.gtf.gz
156
157
158
### 1. Filter
159
This iterates over sequencing run parts, optionally filtered by a list of sequencing parts, and a list of libraries of interest.
160
161
    python indrops.py project.yaml filter [--total-workers 1] [--worker-index 0]
162
          [-r --runs RUNS ] [-l --libraries LIBRARIES ]
163
164
    # --runs comma-separated list of runs :               If specified, step will be restricted to run parts coming #                                                     from runs in the list
165
    # --libraries comma-separated list of libraries :      If specified, step will be restricted to run parts that 
166
    #                                                     contain reads from a library in the list
167
    # 
168
    # Resulting workload (a list of run parts), will be split among N --total-workers,
169
    # where worker with --worker-index i will do steps (i, N+i, 2N+i, ...)
170
171
This step reads the raw FastQ files as input and filters them:
172
  - For every raw read, it determines if the read has the expected structure (depending on library version). 
173
  - For reads with correct structure, it runs Trimmomatic.
174
  - For reads surviving Trimmomatic, it finds and trims the polyA tail a maximum length of 4, and checks if the reads are still above MIN_LEN.
175
  - For surviving reads, it determines which fraction of the read is composed of runs of the same base (only considering runs of 5 or more). 
176
    It rejects reads whose fraction is greater than `low_complexity_filter_arguments:max_low_complexity_fraction`.
177
178
As output, for every input run part, this produces a filtered FastQ file for every library contained in that run. These files are referred to as 'parts of libraries'.
179
180
A log is created detailing what happened to every input read. An index is created that lists the number of reads found for every barcode. 
181
182
### 2. Identify abundant barcodes
183
This iterates over libraries, optionally filtered by a list. 
184
185
    python indrops.py project.yaml identify_abundant_barcodes [--total-workers 1] [--worker-index 0]
186
          [-l --libraries LIBRARIES]
187
  
188
    # --libraries comma-separated list of librares : If specified, step will be restricted to libraries in this list.
189
    # 
190
    # Resulting workload (a list of libraries), will be split among N --total-workers,
191
    # where worker with --worker-index i will do steps (i, N+i, 2N+i, ...)
192
    # 
193
    #    *Note* This step is fast, it does not need to be dispatched to several workers.
194
195
For each library, this collates the results of filtering all the sequencing run parts that have reads related to this library. It then outputs,
196
  - Histogram of the distribution barcode abundances
197
  - Summary table of filtering for that library
198
  - An index to be used by `sort`. 
199
200
201
### 3. Sort reads according to their barcode of origin.
202
This iterates over parts of libraries, optionally filtered by a list.
203
204
    python indrops.py project.yaml sort [--total-workers 1] [--worker-index 0]
205
          [-l --libraries LIBRARIES]
206
207
    # --libraries comma-separated list of libraries :    If specified, step will be restricted to library-run-parts
208
    #                                                   that contain reads from a library in the list
209
    # 
210
    # Resulting workload (a list of library-run-parts), will be split among N --total-workers,
211
    # where worker with --worker-index i will do steps (i, N+i, 2N+i, ...)
212
    #
213
    #    *Note* this step is currently memory intensive, as it loads the entire 'library-run-part' in memory. 
214
215
This sorts the reads according to the name of their barcode of origin. Barcodes with less than 250 total reads (across all library-run-parts) are ignored, and placed at the end of the file.
216
217
As output, this creates a gzipped FastQ file and an index of the byte offsets for every barcode with more than 250 reads.
218
219
### 4. Quantify expression
220
This iterates over a list of barcodes, from a list of optionally filtered libraries. 
221
222
    python indrops.py project.yaml quantify [--total-workers 1] [--worker-index 0]
223
            [-l --libraries LIBRARIES] [-r --runs RUNS ]
224
            [--min-reads 750] [--min-counts 0]
225
            [--analysis prefix '']
226
            [--no-bam]
227
228
    # --min-reads INT :                                 Ignore barcodes with less than specified number of reads.
229
    # --min-counts INT :                                Ignore output for barcodes with less than the specified number
230
    #                                                   of UMIFM counts. This significantly speeds up
231
    #                                                   downstream processing.
232
    # --analysis-prefix STR :                           Prefix output data files with the specified prefix.
233
    #                                                   (filename --> prefix.filename)
234
    # --no-bam :                                        If specified, do not output and process BAM files. 
235
    # 
236
    # --libraries comma-separated list of libraries      If specified, step will be restricted to libraries
237
    #                                                   in this list.
238
    # --runs comma-separated list of runs               If specified, only align reads coming from these runs
239
    #                                                   [This is an uncommon use case.]
240
    # 
241
    # 
242
    # The resulting list of barcodes will be split among --total-workers, with worker identified by --worker-index.
243
    #    *Note* This step requires ~2Gb of memory. 
244
245
The reads from every barcode are aligned using Bowtie, and quantified to UMIFM counts. The counts and quantification metrics are written to a single file per worker. 
246
247
The alignment themselves are modified to include relevant metadata and written to a BAM file. This BAM file is converted from transcriptomic to genomic-coordinates using `rsem-tbam2gbam`, sorted and indexed using `samtools`. At the end of this process, the BAMs for all the barcodes that were processed by a worker are merged and indexed (using `samtools`).
248
249
This step is resumable. If the same --analysis-prefix/--total-workers/--worker-index was previously running, another run will only quantify barcodes that were not previously quantified (or whose data was lost). To force requantification, delete files in /project_dir/library_dir/quant_dir/[prefix.]worker\*_[total_workers]\*
250
251
### 5. Aggregate quantification results
252
This iterates over a list of libraries
253
254
    python indrops.py project.yaml aggregate --total-workers 1
255
            [-l --libraries LIBRARIES]
256
            [--analysis prefix '']
257
258
    # --total-workers INT :                             Total number of workers used in quantification step.
259
    # --analysis-prefix STR :                           Look for quantification data files with specified prefix.
260
    #                                                   (prefix.filename)
261
    # 
262
    # --libraries comma-separated list of libraries      If specified, step will be restricted to libraries
263
    #                                                   in this list.
264
    # 
265
    # 
266
    # The resulting list of libraries will be split among --total-workers, with worker identified by --worker-index.
267
268
The UMIFM counts and quantification metrics from all workers are concatenated to single files. These files are gzipped.
269
If existing, the BAMs from every worker are merged and indexed (using `samtools`)
270
271
272
## Appendix 1: Parallelizing the analysis steps
273
274
### Analyzing only parts of the project.
275
276
Most parts of the analysis can be filtered by specifying a list of sequencing runs,
277
a list of sequencing libraries, or both. When a filter is provided, the analysis will
278
only be carried out on data matching the filter.
279
280
Every part of the analysis can be filtered based on both libraries and sequencing runs.
281
282
    # Will filter all parts from runs Run5 and Run6:
283
    python indrops.py test_project.yaml filter --runs Run5,Run6
284
285
    # Will sort all parts from all runs of libraries test_lib3 and test_lib4:
286
    python indrops.py test_project.yaml sort --libraries test_lib3,test_lib4
287
288
### Dividing the analysis between jobs
289
290
Most parts of the analysis can easily be divided for concurrent processing in different jobs,
291
by specifying the total number of jobs (--total-workers) and the index of the current worker (--worker-index). 
292
293
    # Submitting the 20 commands below would filter all run parts within the project in 20 different parts.
294
    python indrops.py test_project.yaml filter --total-workers 20 --worker-index [0-19]
295
296
## Appendix 2: Using a custom Python environment on the Orchestra Cluster
297
298
### How to use an existing environment
299
300
#### Option 1 (simpler, but uglier):
301
Using the example installed with the instructions below, in "/groups/klein/adrian/pyndrops”:
302
303
    source /groups/klein/adrian/miniconda/bin/activate /groups/klein/adrian/pyndrops
304
305
(To make this line shorter, prepend “ /groups/klein/adrian/miniconda/bin/“ to PATH.)
306
307
The terminal should now be prepended with  (/groups/klein/adrian/pyndrops):  (/groups/klein/adrian/pyndrops)av79@loge:~$ (This is the ugly part)
308
309
Check that the correct python executable is being used:
310
311
    which python
312
313
returns:
314
315
    /groups/klein/adrian/pyndrops/bin/python
316
317
Test that all packages can be imported:
318
319
    python -c """import yaml; import pysam; import pyfasta; import numpy; import matplotlib; print('Nice work. All packages loaded correctly.')"""
320
    # Should print :
321
    # "Nice work. All packages loaded correctly.""
322
323
#### Option 2 (slightly harder, but shorter prefix):
324
Using the example above, installed in "/groups/klein/adrian/pyndrops”.
325
This assumes you prepended “ /groups/klein/adrian/miniconda/bin/“ to PATH.
326
327
Add “/groups/klein/adrian” as a place to search for python environments, with
328
329
    conda config --add envs_dirs '/groups/klein/adrian/'
330
331
Then, you only need
332
333
    source activate pyndrops
334
335
Run the same tests are above.
336
337
Now, "python" should refer to this specific python installation!
338
339
### How to install conda and create a new environment
340
341
Download Miniconda (the anaconda package manager, without all the packages)
342
343
    mkdir -pv /user_owned/path
344
    cd  /user_owned/path
345
    wget https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh
346
347
Install Miniconda
348
349
    bash Miniconda-latest-Linux-x86_64.sh
350
    # Agree to license with “yes”, and choose to install in a directory that is user owned.
351
    # I installed it in: /groups/klein/adrian/miniconda
352
353
Create a new Python environment (in this example, in /groups/klein/adrian/pyndrops)
354
install Python2.7, Numpy, Scipy, Pandas, Matplotlib, PyYaml, PySAM
355
356
    conda create -p /groups/klein/adrian/pyndrops python numpy scipy pandas pyyaml matplotlib pip
357
    source activate /groups/klein/adrian/pyndrops
358
    pip install pyfasta pysam==0.9.1
359
360
## Appendix 3: Installing Samtools 1.3.1
361
362
    mkdir -pv SAMTOOLS_DIR
363
    cd SAMTOOLS_DIR
364
    wget https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2
365
    tar xvfj samtools-1.3.1.tar.bz2
366
    cd samtools-1.3.1
367
    make
368
    make prefix=. install
369
370
Now add `SAMTOOLS_DIR/samtools-1.3.1/bin/` as the `samtools_dir` in your project YAML file.