a b/README.md
1
# ML-Final-Project
2
3
## Introduction
4
5
The inspiration for this project comes from the publication: Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers
6
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126698/)
7
8
The data used for this analysis is Illumina Hiseq2500 RNA-seq data found at NCBI SRA under the project ID SRP117020. Samples were obtained from 130 patients diagnosed with non-small cell lung cancer (NSCLC).
9
The data contains sequences with distribution of poor to well differentiated adenocarcinomas and squamous cell cancers
10
11
The reference genome used for alignment can be found here: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/.
12
13
### Project Overview 
14
15
```mermaid
16
graph TD;
17
    RNA_seq_FASTQs-->Reference_Genome_Alignment_HISAT2;
18
    RNA_seq_FASTQs-->FastQC;
19
    Reference_Genome_Alignment_HISAT2-->Transcript_Quantification/Merging_StringTie;
20
    Transcript_Quantification/Merging_StringTie-->Differential_Expression_DESeq2;
21
    Differential_Expression_DESeq2-->Pathway_Analysis_KOBAS;
22
    Differential_Expression_DESeq2-->SCLC/NSCLC_Gene_Classification;
23
    Pathway_Analysis_KOBAS-->SCLC/NSCLC_Gene_Classification;
24
    Differential_Expression_DESeq2-->Dataset_Annotation;
25
    Pathway_Analysis_KOBAS-->Dataset_Annotation;
26
    Dataset_Annotation-->Data_Preprocessing;
27
    Data_Preprocessing-->Maching_Learning_Models;
28
    Maching_Learning_Models-->Logistic_Regression;
29
    Maching_Learning_Models-->Random_Forest;
30
    Maching_Learning_Models-->Gradient_Boost;
31
    Maching_Learning_Models-->AdaBoost;
32
    Maching_Learning_Models-->KNN;
33
```
34
35
#### Data Preparation (Part I)
36
37
Preparation of the data for the ML algorithms includes a pipeline containing various bioinformatics tools:
38
```mermaid
39
graph TD;
40
    SRA_accessions.txt-->SRA-Toolkits/prefetch;
41
    SRA-Toolkits/prefetch-->.sra_files;
42
    .sra_files-->fastq-dump;
43
    fastq-dump-->untrimmed_fastqs;
44
    untrimmed_fastqs-->FastQC;
45
    FastQC-->MultiQC
46
    MultiQC-->untrimmed_quality_report
47
    untrimmed_fastqs-->fastp;
48
    fastp-->trimmed_fastq_files;
49
    trimmed_fastq_files-->FastQC2;
50
    FastQC2-->MultiQC2;
51
    MultiQC2-->trimmed_quality_report
52
    trimmed_fastq_files-->HISAT2;
53
```
54
55
* To use SRA-Toolkits prefetch to retrieve all samples:
56
``` 
57
while read accession; do
58
  prefetch "$accession"
59
done < SRA_accessions.txt`
60
```
61
62
* To convert all paired sample .sra files to fastq files:
63
```
64
for dir in SRR*/; do
65
    echo "Processing $dir"
66
    fastq-dump --split-files --gzip "$dir/${dir%/}.sra"
67
done
68
```
69
70
71
* To trim all fastq samples:
72
```
73
for file in *_1.fastq; do
74
    base=$(basename "$file" "_1.fastq")
75
    fastp -i "${base}_1.fastq" -I "${base}_2.fastq" -o "../FASTQ_SAMPLES_TRIMMED/${base}_1_trimmed.fq" -O "../FASTQ_SAMPLES_TRIMMED/${base}_2_trimmed.fq"
76
done
77
```
78
79
* Align reads to the reference genome using `alignments.sh` script:
80
```
81
chmod +x alignment.sh
82
./alignment.sh
83
```
84