a b/docs/mapping_small_rna.md
1
# Mapping small RNA-seq
2
3
## Prepare genome annotation
4
5
For mapping of small RNA-seq reads, exSEEK adopts sequential mapping strategy, which assign reads to gene annotations sequentially according the the ordered defined by the user.
6
By default, exSEEK assign reads in the following order:
7
8
spike-in, rRNA, lncRNA, miRNA, mRNA, piRNA, snoRNA, snRNA, srpRNA, tRNA, tucpRNA, Y_RNA, genome, circRNA
9
10
We derived the genome annotation file from various sources: 
11
12
| Type | Number of genes | Source |
13
| :--- | :--- | :--- |
14
| miRNA | 1917 | miRBase hairpin \(Version 22\) |
15
| piRNA | 23431 | piRNABank |
16
| lncRNA | 15778 | GENCODE V27 and mitranscriptome |
17
| rRNA | 37 | NCBI refSeq 109 |
18
| mRNA | 19836 | GENCODE V27 |
19
| snoRNA | 943 | GENCODE V27 |
20
| snRNA | 1900 | GENCODE V27 |
21
| srpRNA | 680 | GENCODE V27 |
22
| tRNA | 649 | GENCODE V27 |
23
| tucpRNA | 3734 | GENCODE V27 |
24
| Y\_RNA | 756 | GENCODE V27 |
25
| circRNA | 140527 | circBase |
26
| repeats | - | UCSC Genome Browser \(rmsk\) |
27
| promoter | - | ChromHMM tracks from 9 cell lines from UCSC Genome Browser |
28
| enhancer | - | ChromHMM tracks from 9 cell lines from UCSC Genome Browser |
29
30
spike-in is a special type of genome annotation that should be provided by the user if spike-in sequences are used. 
31
32
The paths of the bowtie2 index files:
33
34
| Type | FASTA file | bowtie2 index file |
35
| :--- | :--- | :--- |
36
| spike-in | `${genome_dir}/fasta/spikein_small.fa` | `${genome_dir}/index/bowtie2/spikein` |
37
| rRNA | `${genome_dir}/fasta/rRNA.fa` | `${genome_dir}/index/bowtie2/rRNA` |
38
| miRNA | `${genome_dir}/fasta/miRNA.fa` | `${genome_dir}/rsem_index/bowtie2/miRNA` |
39
| piRNA | `${genome_dir}/fasta/piRNA.fa` | `${genome_dir}/rsem_index/bowtie2/piRNA` |
40
| lncRNA | `${genome_dir}/fasta/lncRNA.fa` | `${genome_dir}/rsem_index/bowtie2/lncRNA` |
41
| mRNA | `${genome_dir}/fasta/mRNA.fa` | `${genome_dir}/rsem_index/bowtie2/mRNA` |
42
| snoRNA | `${genome_dir}/fasta/snoRNA.fa` | `${genome_dir}/rsem_index/bowtie2/snoRNA` |
43
| snRNA | `${genome_dir}/fasta/snRNA.fa` | `${genome_dir}/rsem_index/bowtie2/snRNA` |
44
| srpRNA | `${genome_dir}/fasta/srpRNA.fa` | `${genome_dir}/rsem_index/bowtie2/srpRNA` |
45
| tRNA | `${genome_dir}/fasta/tRNA.fa` | `${genome_dir}/rsem_index/bowtie2/tRNA` |
46
| tucpRNA | `${genome_dir}/fasta/tucpRNA.fa` | `${genome_dir}/rsem_index/bowtie2/tucpRNA` |
47
| Y_RNA | `${genome_dir}/fasta/Y_RNA.fa` | `${genome_dir}/rsem_index/bowtie2/Y_RNA` |
48
| circRNA | `${genome_dir}/fasta/circRNA.fa` | `${genome_dir}/rsem_index/bowtie2/circRNA` |
49
50
**Note**: `${genome_dir}` is the root directory of genome annotation files.
51
52
### Build bowtie2 index for spike-in sequences
53
54
If your samples contain spike-in sequences, you should first prepare a FASTA file of your spike-in sequences and copy it to `${genome_dir}/fasta/spikein_small.fa`. 
55
Then create an index file (`${genome_dir}/fasta/spikein_small.fai`) by the following command:
56
57
```bash
58
samtools faidx ${genome_dir}/fasta/spikein_small.fa
59
```
60
61
Run the following commands to build bowtie2 index files for spike-in sequences:
62
63
```bash
64
cut -f1,2 ${genome_dir}/fasta/spikein_small.fa.fai > ${genome_dir}/chrom_sizes/spikein_small
65
{
66
    echo -e 'chrom\tstart\tend\tname\tscore\tstrand\tgene_id\ttranscript_id\tgene_name\ttranscript_name\tgene_type\ttranscript_type\tsource'
67
    awk 'BEGIN{OFS="\t";FS="\t"}{print $1,0,$2,$1,0,"+",$1,$1,$1,$1,"spikein","spikein","spikein"}' ${genome_dir}/fasta/spikein_small.fa.fai
68
} > ${genome_dir}/transcript_table/spikein_small.txt
69
bowtie2-build ${genome_dir}/fasta/spikein_small.fa ${genome_dir}/index/bowtie2/spikein_small
70
```