ML-Final-Project

Introduction

The inspiration for this project comes from the publication: Novel Biomarker Prediction for Lung Cancer Using Random Forest Classifiers
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10126698/)

The data used for this analysis is Illumina Hiseq2500 RNA-seq data found at NCBI SRA under the project ID SRP117020. Samples were obtained from 130 patients diagnosed with non-small cell lung cancer (NSCLC).
The data contains sequences with distribution of poor to well differentiated adenocarcinomas and squamous cell cancers

The reference genome used for alignment can be found here: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/.

Project Overview

graph TD;
    RNA_seq_FASTQs-->Reference_Genome_Alignment_HISAT2;
    RNA_seq_FASTQs-->FastQC;
    Reference_Genome_Alignment_HISAT2-->Transcript_Quantification/Merging_StringTie;
    Transcript_Quantification/Merging_StringTie-->Differential_Expression_DESeq2;
    Differential_Expression_DESeq2-->Pathway_Analysis_KOBAS;
    Differential_Expression_DESeq2-->SCLC/NSCLC_Gene_Classification;
    Pathway_Analysis_KOBAS-->SCLC/NSCLC_Gene_Classification;
    Differential_Expression_DESeq2-->Dataset_Annotation;
    Pathway_Analysis_KOBAS-->Dataset_Annotation;
    Dataset_Annotation-->Data_Preprocessing;
    Data_Preprocessing-->Maching_Learning_Models;
    Maching_Learning_Models-->Logistic_Regression;
    Maching_Learning_Models-->Random_Forest;
    Maching_Learning_Models-->Gradient_Boost;
    Maching_Learning_Models-->AdaBoost;
    Maching_Learning_Models-->KNN;

Data Preparation (Part I)

Preparation of the data for the ML algorithms includes a pipeline containing various bioinformatics tools:

graph TD;
    SRA_accessions.txt-->SRA-Toolkits/prefetch;
    SRA-Toolkits/prefetch-->.sra_files;
    .sra_files-->fastq-dump;
    fastq-dump-->untrimmed_fastqs;
    untrimmed_fastqs-->FastQC;
    FastQC-->MultiQC
    MultiQC-->untrimmed_quality_report
    untrimmed_fastqs-->fastp;
    fastp-->trimmed_fastq_files;
    trimmed_fastq_files-->FastQC2;
    FastQC2-->MultiQC2;
    MultiQC2-->trimmed_quality_report
    trimmed_fastq_files-->HISAT2;

To use SRA-Toolkits prefetch to retrieve all samples:

while read accession; do
  prefetch "$accession"
done < SRA_accessions.txt`

To convert all paired sample .sra files to fastq files:

for dir in SRR*/; do
    echo "Processing $dir"
    fastq-dump --split-files --gzip "$dir/${dir%/}.sra"
done

To trim all fastq samples:

for file in *_1.fastq; do
    base=$(basename "$file" "_1.fastq")
    fastp -i "${base}_1.fastq" -I "${base}_2.fastq" -o "../FASTQ_SAMPLES_TRIMMED/${base}_1_trimmed.fq" -O "../FASTQ_SAMPLES_TRIMMED/${base}_2_trimmed.fq"
done

Align reads to the reference genome using alignments.sh script:

chmod +x alignment.sh
./alignment.sh