|
a |
|
b/README.md |
|
|
1 |
# AI-Genomics: Genetics with Artificial Intelligence Project |
|
|
2 |
|
|
|
3 |
## Description |
|
|
4 |
Implemented AI for genetic analysis using PyTorch (machine learning framework) with two colleagues, mentored by a PhD geneticist from Universidad de los Andes and a Master's student in Applied Mathematics at Universidad Nacional de Colombia. This project excels in analyzing DNA sequences and classifying them based on discernible motifs. |
|
|
5 |
|
|
|
6 |
## Requirements |
|
|
7 |
- Python 3 |
|
|
8 |
- Jupyter Notebook (recommended for running in Google Colab) |
|
|
9 |
|
|
|
10 |
## Instructions to Run the Code |
|
|
11 |
1. Clone the repository to your local machine: |
|
|
12 |
```bash |
|
|
13 |
git clone https://github.com/anjimenezp/AI-Genetics.git |
|
|
14 |
cd AI-Genetics |
|
|
15 |
|
|
|
16 |
## Project Overview |
|
|
17 |
This repository contains code for a genomics project utilizing artificial intelligence for the classification of DNA sequences. The code includes the following components: |
|
|
18 |
|
|
|
19 |
### 1. Data Extraction: |
|
|
20 |
- Gene sequence data is extracted from a CSV file using Pandas. |
|
|
21 |
|
|
|
22 |
### 2. Simulated Sequence Generation (commented out): |
|
|
23 |
- The code provides functionality for generating simulated DNA sequences, but it is not used in the main code. |
|
|
24 |
|
|
|
25 |
### 3. Label Quantification: |
|
|
26 |
- Sequence labels are encoded using scikit-learn's LabelEncoder. |
|
|
27 |
|
|
|
28 |
### 4. One-Hot Encoding: |
|
|
29 |
- DNA sequences are cleaned and converted to one-hot encoding using PyTorch. |
|
|
30 |
|
|
|
31 |
### 5. Training Splits: |
|
|
32 |
- The data is split into training, validation, and test sets for model training and evaluation. |
|
|
33 |
|
|
|
34 |
### 6. DataLoader Preparation: |
|
|
35 |
- PyTorch DataLoaders are prepared for efficient batch processing during training. |
|
|
36 |
|
|
|
37 |
### 7. CNN Model Definition: |
|
|
38 |
- A Convolutional Neural Network (CNN) is defined for classifying DNA sequences. |
|
|
39 |
|
|
|
40 |
### 8. Training Loop Functions: |
|
|
41 |
- Functions for training and validation loops are defined. |
|
|
42 |
|
|
|
43 |
### 9. Model Evaluation: |
|
|
44 |
- The trained model is evaluated on a test set, and performance metrics are displayed. |
|
|
45 |
|
|
|
46 |
### 10. Plotting: |
|
|
47 |
- Matplotlib is used to plot training and validation loss curves. |
|
|
48 |
|
|
|
49 |
### 11. Example Prediction: |
|
|
50 |
- An example DNA sequence is provided, and the trained model predicts its class. |
|
|
51 |
|
|
|
52 |
Feel free to explore the code and adapt it to your genomics classification tasks. If you have any questions or suggestions, please open an issue. |
|
|
53 |
|
|
|
54 |
Note: The code assumes the availability of PyTorch, scikit-learn, pandas, and matplotlib libraries. Make sure to install these dependencies before running the code. |