--- a +++ b/README.md @@ -0,0 +1,370 @@ +## Project YAML file. + +An inDrops project is composed of a series of sequencing runs, each including one (or several) indrops libraries within it. A sequencing run can further be split into several parts (effectively arbitrary chunks) to parallelize the analysis. Give example of a chunk + +The project yaml file contains the details of all sequencing runs and libraries within a project. + +The same project can contain runs from different versions of the inDrops platform. + +A project will be aligned against the same reference genome with the same alignment parameters. + +## Supported library versions + - v1 : original design where R2 is the biological read and R1 is the metadata read. + - v2 : inversion of v1 where R1 is the biological read and R2 is the metadata read. + - v3 : summer 2016 redesign requiring manual demultiplexing. R1 is the biological read. + R2 carries the first half of the gel barcode, R3 carries the library index and R4 + the second half of the gel barcode, the UMI and a fraction of the polyA tail. + +## Installation +The package requires + - Python 2.7 (with the packages numpy, scipy, matplotlib, pysam>0.9.0, pyyaml, pyfasta). [See Appendix 2] + - RSEM (1.2.19+) + - Bowtie (1.1.1+) + - samtools (1.3.1+) [See Appendix 3] *This specific version is needed to account for a BAM-format oddity in RSEM output. + - Java +The path to the directories containing these executables should be set in the project YAML. +If these executables can be found in the PATH variables, this project YAML paths can be left empty, or not specified. + +### March 7th Notice -- PySAM version. + +Previous installation instructions install PySAM version 0.6.0. To install the correct PySAM version, use the following commands: + + conda remove pysam + conda install pip + pip install pysam==0.9.1 + + +## Project YAML file + +An example YAML file is provided in `test/test_project.yaml`. It should contain the following information: + + project_name : "project_name" + project_dir : "/path/to/project/dir" #This dir should be user-owned and writable, all output will go into this dir. + paths : + bowtie_index : "/path/to/index" #This index will be built automatically + # The paths below can be omitted if the relevant directories are already on $PATH + bowtie_dir : "/path/to/bowtie/dir/" + python_dir : "/path/to/env/bins/" + java_dir: "/path/to/java/dir/" + rsem_dir: "/path/to/rsem/dir/" + samtools_dir: "/path/to/samtools-1.3.1/bin/" #This needs to be version 1.3.1, 1.3 is not good enough! + + sequencing_runs : + # A list of sequencing runs which form the project. + # Each run should have: + - name : "MyRun" # The name of the run will be used as a prefix in filenames, so keep it sane. + version : "vN" # Can be 'v1', 'v2' or 'v3' + + # For a run with a single 'part', and a single library + dir : "/path/to/run_files/" + fastq_path : "{read}.fastq.gz" # Read with be replaced by R1, R2, R3, R4 as appropriate. + library_name : "my_library" + + # This will expect to find the files: + # /path/to/run_files/R1.fastq.gz (and R2...) + + # For a run with several parts, but a single library + dir : "/path/to/run_files/" + fastq_path : "{split_affix}_{read}.fastq.gz" # Read with be replaced by R1, R2, R3, R4 as appropriate. + split_affixes : ["L001", "L002"] + library_name : "my_library" + + # This will expect to find the files: + # /path/to/run_files/L001_R1.fastq.gz (and R2...) + # /path/to/run_files/L002_R1.fastq.gz (and R2...) + + # For a run with several parts, several libraries, that have already been demultiplexed + dir : "/path/to/run_files/" + fastq_path : "{library_prefix}_{split_affix}_{read}.fastq.gz" # Read with be replaced by R1, R2, R3, R4 as appropriate. + split_affixes : ["L001", "L002"] + libraries : + - {library_name: "test_lib1", library_prefix: "lib1"} + - {library_name: "test_lib2", library_prefix: "lib2"} + + # This will expect to find the files: + # /path/to/run_files/lib1_L001_R1.fastq.gz (and R2...) + # /path/to/run_files/lib1_L002_R1.fastq.gz (and R2...) + # /path/to/run_files/lib2_L001_R1.fastq.gz (and R2...) + # /path/to/run_files/lib2_L002_R1.fastq.gz (and R2...) + + # For a V3 run with several parts, with several libraries that are not already demultiplexed + dir : "/path/to/run_files/" + fastq_path : "{library_prefix}_{split_affix}_{read}.fastq.gz" # Read with be replaced by R1, R2, R3, R4 as appropriate. + split_affixes : ["L001", "L002", "L003", "L004"] + libraries : # The library index is what the expected index read sequence (on a NextSeq, this is the reverse complement of the index sequence) + - {library_name: "test_lib3", library_index: "ATAGAG"} + - {library_name: "test_lib4", library_index: "AGAGGA"} + + # This will expect to find the files: + # /path/to/run_files/lib1_L001_R1.fastq.gz (and R2, R3, R4...) + # /path/to/run_files/lib1_L002_R1.fastq.gz (and R2, R3, R4...) + # /path/to/run_files/lib1_L003_R1.fastq.gz (and R2, R3, R4...) + # /path/to/run_files/lib1_L004_R1.fastq.gz (and R2, R3, R4...) + +#### Note about v3 runs. +The raw BCL files are needed for manual demultiplexing. Move the raw BCL files to a run directory, then use the following command to extract the R1,R2,R3 and R4 files. + + cd /run/dir/ + bcl2fastq --use-bases-mask y*,y*,y*,y* --mask-short-adapter-reads 0 --minimum-trimmed-read-length 0 + # The 'dir' used in the project YAML file should then be: + # /run/dir/Data/Intensities/BaseCalls/ + +## Analysis steps + +### 0. Building a transcriptome bowtie index. + +The index comprises a bowtie transcriptome index. It is built using RSEM so we can use `rsem-tbam2gbam` to convert between transcriptome and genome coordinates. +The index also has an annotation of which locations are soft-masked (denoting low-complexity regions) to allow filtering of alignments primarily found in soft-masked regions. + +The index used by a project is give by `paths:bowtie_index` in the project YAML file. Several projects can share the same index. +Example: + paths : + bowtie_index : "/path/to/index_dir/indrops_ensembl_GRCh38_rel85/Homo_sapiens.GRCh38.85.annotated" + ...more paths + +If no index exists, it needs to be built + +#### Example creation of a human index, using ENSEMBL release 85 + mkdir -pv DOWNLOAD_DIR + cd DOWNLOAD_DIR + + # Download the soft-masked, primary assembly Genome Fasta file + wget ftp://ftp.ensembl.org/pub/release-85/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz + + # Download the corresponding GTF file. + wget ftp://ftp.ensembl.org/pub/release-85/gtf/homo_sapiens/Homo_sapiens.GRCh38.85.gtf.gz + + # This command will go through all the steps for creating the index + python indrops.py project.yaml build_index \ + --genome-fasta-gz DOWNLOAD_DIR/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz \ + --ensembl-gtf-gz DOWNLOAD_DIR/Homo_sapiens.GRCh38.85.gtf.gz + +#### Example creation of a mouse index, using ENSEMBL release 85 + mkdir -pv DOWNLOAD_DIR + cd DOWNLOAD_DIR + + # Download the soft-masked, primary assembly Genome Fasta file + wget ftp://ftp.ensembl.org/pub/release-85/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.gz + + # Download the corresponding GTF file. + wget ftp://ftp.ensembl.org/pub/release-85/gtf/mus_musculus/Mus_musculus.GRCm38.85.gtf.gz + + # This command will go through all the steps for creating the index + python indrops.py project.yaml build_index \ + --genome-fasta-gz DOWNLOAD_DIR/Mus_musculus.GRCm38.dna_sm.primary_assembly.fa.gz \ + --ensembl-gtf-gz DOWNLOAD_DIR/Mus_musculus.GRCm38.85.gtf.gz + + +### 1. Filter +This iterates over sequencing run parts, optionally filtered by a list of sequencing parts, and a list of libraries of interest. + + python indrops.py project.yaml filter [--total-workers 1] [--worker-index 0] + [-r --runs RUNS ] [-l --libraries LIBRARIES ] + + # --runs comma-separated list of runs : If specified, step will be restricted to run parts coming # from runs in the list + # --libraries comma-separated list of libraries : If specified, step will be restricted to run parts that + # contain reads from a library in the list + # + # Resulting workload (a list of run parts), will be split among N --total-workers, + # where worker with --worker-index i will do steps (i, N+i, 2N+i, ...) + +This step reads the raw FastQ files as input and filters them: + - For every raw read, it determines if the read has the expected structure (depending on library version). + - For reads with correct structure, it runs Trimmomatic. + - For reads surviving Trimmomatic, it finds and trims the polyA tail a maximum length of 4, and checks if the reads are still above MIN_LEN. + - For surviving reads, it determines which fraction of the read is composed of runs of the same base (only considering runs of 5 or more). + It rejects reads whose fraction is greater than `low_complexity_filter_arguments:max_low_complexity_fraction`. + +As output, for every input run part, this produces a filtered FastQ file for every library contained in that run. These files are referred to as 'parts of libraries'. + +A log is created detailing what happened to every input read. An index is created that lists the number of reads found for every barcode. + +### 2. Identify abundant barcodes +This iterates over libraries, optionally filtered by a list. + + python indrops.py project.yaml identify_abundant_barcodes [--total-workers 1] [--worker-index 0] + [-l --libraries LIBRARIES] + + # --libraries comma-separated list of librares : If specified, step will be restricted to libraries in this list. + # + # Resulting workload (a list of libraries), will be split among N --total-workers, + # where worker with --worker-index i will do steps (i, N+i, 2N+i, ...) + # + # *Note* This step is fast, it does not need to be dispatched to several workers. + +For each library, this collates the results of filtering all the sequencing run parts that have reads related to this library. It then outputs, + - Histogram of the distribution barcode abundances + - Summary table of filtering for that library + - An index to be used by `sort`. + + +### 3. Sort reads according to their barcode of origin. +This iterates over parts of libraries, optionally filtered by a list. + + python indrops.py project.yaml sort [--total-workers 1] [--worker-index 0] + [-l --libraries LIBRARIES] + + # --libraries comma-separated list of libraries : If specified, step will be restricted to library-run-parts + # that contain reads from a library in the list + # + # Resulting workload (a list of library-run-parts), will be split among N --total-workers, + # where worker with --worker-index i will do steps (i, N+i, 2N+i, ...) + # + # *Note* this step is currently memory intensive, as it loads the entire 'library-run-part' in memory. + +This sorts the reads according to the name of their barcode of origin. Barcodes with less than 250 total reads (across all library-run-parts) are ignored, and placed at the end of the file. + +As output, this creates a gzipped FastQ file and an index of the byte offsets for every barcode with more than 250 reads. + +### 4. Quantify expression +This iterates over a list of barcodes, from a list of optionally filtered libraries. + + python indrops.py project.yaml quantify [--total-workers 1] [--worker-index 0] + [-l --libraries LIBRARIES] [-r --runs RUNS ] + [--min-reads 750] [--min-counts 0] + [--analysis prefix ''] + [--no-bam] + + # --min-reads INT : Ignore barcodes with less than specified number of reads. + # --min-counts INT : Ignore output for barcodes with less than the specified number + # of UMIFM counts. This significantly speeds up + # downstream processing. + # --analysis-prefix STR : Prefix output data files with the specified prefix. + # (filename --> prefix.filename) + # --no-bam : If specified, do not output and process BAM files. + # + # --libraries comma-separated list of libraries If specified, step will be restricted to libraries + # in this list. + # --runs comma-separated list of runs If specified, only align reads coming from these runs + # [This is an uncommon use case.] + # + # + # The resulting list of barcodes will be split among --total-workers, with worker identified by --worker-index. + # *Note* This step requires ~2Gb of memory. + +The reads from every barcode are aligned using Bowtie, and quantified to UMIFM counts. The counts and quantification metrics are written to a single file per worker. + +The alignment themselves are modified to include relevant metadata and written to a BAM file. This BAM file is converted from transcriptomic to genomic-coordinates using `rsem-tbam2gbam`, sorted and indexed using `samtools`. At the end of this process, the BAMs for all the barcodes that were processed by a worker are merged and indexed (using `samtools`). + +This step is resumable. If the same --analysis-prefix/--total-workers/--worker-index was previously running, another run will only quantify barcodes that were not previously quantified (or whose data was lost). To force requantification, delete files in /project_dir/library_dir/quant_dir/[prefix.]worker\*_[total_workers]\* + +### 5. Aggregate quantification results +This iterates over a list of libraries + + python indrops.py project.yaml aggregate --total-workers 1 + [-l --libraries LIBRARIES] + [--analysis prefix ''] + + # --total-workers INT : Total number of workers used in quantification step. + # --analysis-prefix STR : Look for quantification data files with specified prefix. + # (prefix.filename) + # + # --libraries comma-separated list of libraries If specified, step will be restricted to libraries + # in this list. + # + # + # The resulting list of libraries will be split among --total-workers, with worker identified by --worker-index. + +The UMIFM counts and quantification metrics from all workers are concatenated to single files. These files are gzipped. +If existing, the BAMs from every worker are merged and indexed (using `samtools`) + + +## Appendix 1: Parallelizing the analysis steps + +### Analyzing only parts of the project. + +Most parts of the analysis can be filtered by specifying a list of sequencing runs, +a list of sequencing libraries, or both. When a filter is provided, the analysis will +only be carried out on data matching the filter. + +Every part of the analysis can be filtered based on both libraries and sequencing runs. + + # Will filter all parts from runs Run5 and Run6: + python indrops.py test_project.yaml filter --runs Run5,Run6 + + # Will sort all parts from all runs of libraries test_lib3 and test_lib4: + python indrops.py test_project.yaml sort --libraries test_lib3,test_lib4 + +### Dividing the analysis between jobs + +Most parts of the analysis can easily be divided for concurrent processing in different jobs, +by specifying the total number of jobs (--total-workers) and the index of the current worker (--worker-index). + + # Submitting the 20 commands below would filter all run parts within the project in 20 different parts. + python indrops.py test_project.yaml filter --total-workers 20 --worker-index [0-19] + +## Appendix 2: Using a custom Python environment on the Orchestra Cluster + +### How to use an existing environment + +#### Option 1 (simpler, but uglier): +Using the example installed with the instructions below, in "/groups/klein/adrian/pyndrops”: + + source /groups/klein/adrian/miniconda/bin/activate /groups/klein/adrian/pyndrops + +(To make this line shorter, prepend “ /groups/klein/adrian/miniconda/bin/“ to PATH.) + +The terminal should now be prepended with (/groups/klein/adrian/pyndrops): (/groups/klein/adrian/pyndrops)av79@loge:~$ (This is the ugly part) + +Check that the correct python executable is being used: + + which python + +returns: + + /groups/klein/adrian/pyndrops/bin/python + +Test that all packages can be imported: + + python -c """import yaml; import pysam; import pyfasta; import numpy; import matplotlib; print('Nice work. All packages loaded correctly.')""" + # Should print : + # "Nice work. All packages loaded correctly."" + +#### Option 2 (slightly harder, but shorter prefix): +Using the example above, installed in "/groups/klein/adrian/pyndrops”. +This assumes you prepended “ /groups/klein/adrian/miniconda/bin/“ to PATH. + +Add “/groups/klein/adrian” as a place to search for python environments, with + + conda config --add envs_dirs '/groups/klein/adrian/' + +Then, you only need + + source activate pyndrops + +Run the same tests are above. + +Now, "python" should refer to this specific python installation! + +### How to install conda and create a new environment + +Download Miniconda (the anaconda package manager, without all the packages) + + mkdir -pv /user_owned/path + cd /user_owned/path + wget https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh + +Install Miniconda + + bash Miniconda-latest-Linux-x86_64.sh + # Agree to license with “yes”, and choose to install in a directory that is user owned. + # I installed it in: /groups/klein/adrian/miniconda + +Create a new Python environment (in this example, in /groups/klein/adrian/pyndrops) +install Python2.7, Numpy, Scipy, Pandas, Matplotlib, PyYaml, PySAM + + conda create -p /groups/klein/adrian/pyndrops python numpy scipy pandas pyyaml matplotlib pip + source activate /groups/klein/adrian/pyndrops + pip install pyfasta pysam==0.9.1 + +## Appendix 3: Installing Samtools 1.3.1 + + mkdir -pv SAMTOOLS_DIR + cd SAMTOOLS_DIR + wget https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2 + tar xvfj samtools-1.3.1.tar.bz2 + cd samtools-1.3.1 + make + make prefix=. install + +Now add `SAMTOOLS_DIR/samtools-1.3.1/bin/` as the `samtools_dir` in your project YAML file.