Diff of /README.md [000000] .. [a108cb]

Switch to unified view

a b/README.md
1
# Transfer Learning vs. Fine-Tuning ChemBERTa for Regression
2
3
## Summary
4
5
This repository demonstrates the use of fine-tuning vs transfer learning for a regression task with [ChemBERTa](https://arxiv.org/abs/2010.09885), a specialized BERT-like model applied to chemical SMILES data. SMILES (Simplified Molecular Input Line Entry System) is a notation for representing chemical structures as text. We explore when transfer learning might be more appropriate than fine-tuning ChemBERTa given our dataset, which is significantly smaller than the model's pre-training data (a few hundred vs 77 millions examples).
6
7
The regression task is to predict the pIC50 values for inhibiting the catalytic activity of Dihydrofolate Reductase ([DHFR](https://en.wikipedia.org/wiki/Dihydrofolate_reductase)) in homo sapiens. DHFR is a crucial enzyme in the folate metabolic pathway, and inhibiting its catalytic activity can disrupt the production of tetrahydrofolate, which is necessary for DNA synthesis. This disruption can slow down or prevent cancer cell replication, making DHFR an important target for cancer treatment.
8
9
pIC50 is a measure of a substance's potency, representing the negative logarithm (base 10) of its Inhibitory Concentration at 50% (IC50). 
10
11
12
## Dataset
13
14
Downloaded from https://github.com/KISysBio/qsar-models/tree/master, the dataset consists of SMILES representations, molecular descriptors, and corresponding pIC50 values. 
15
16
17
## Requirements
18
19
Before running the notebooks, ensure you have the following dependencies installed:
20
21
- Python 3.6+
22
- PyTorch
23
- Transformers library (Hugging Face)
24
- XGBoost
25
- NumPy
26
- pandas
27
- scikit-learn
28
- scipy
29
- matplotlib
30
- tqdm
31
- RDKit
32
33
You can install these packages using Conda and pip:
34
35
Using Conda (recommended for RDKit):
36
37
```bash
38
conda create -n myenv python=3.7  # Create a new Conda environment (optional)
39
conda activate myenv             # Activate the Conda environment (if created)
40
conda install -c conda-forge rdkit
41
pip install torch transformers xgboost numpy pandas scikit-learn scipy matplotlib tqdm
42
```
43
44
This will ensure proper installation of RDKit through Conda, which is a common practice in cheminformatics. Make sure to create and activate a Conda environment as needed.
45
46
47
## Notebooks
48
49
### 1. Data Preprocessing and Light EDA
50
51
- Notebook: `preprocessing.ipynb`
52
- This notebook covers data preprocessing and exploratory data analysis (EDA).
53
54
### 2. Fine-Tuning ChemBERTa
55
56
- Notebook: `fine-tune.ipynb`
57
- In this notebook, ChemBERTa is fine-tuned using the Transformers library. The fine-tuned model is trained on the SMILES representations of molecules to predict pIC50 values.
58
59
### 3. Transfer Learning with ChemBERTa Embeddings using XGBoost
60
61
- Notebook: `transfer-learning.ipynb`
62
- This notebook demonstrates transfer learning using ChemBERTa embeddings of SMILES representations and XGBoost regressor. It explores whether pre-trained ChemBERTa embeddings enhance predictive performance compared to fine-tuning.
63
64
### 4. Transfer Learning with ChemBERTa Embeddings and Molecular Descriptors
65
66
- Notebook: `transfer-learning-plus-descriptors.ipynb`
67
- This notebook extends transfer learning by incorporating molecular descriptors alongside ChemBERTa embeddings. It evaluates the impact of additional molecular features on prediction accuracy.
68
69
## Results
70
71
The notebooks provide insights into the performance of different approaches. Metrics such as Mean Squared Error (MSE) and Spearman Correlation are used to evaluate model performance.
72
73
Given the dataset is small, fine-tuning was not sufficient to update the pre-trained weights significantly, and we ended up with nearly identical predictions for all the observations in the test set. 
74
75
Transfer-learning improved prediction significantly, and the addition of molecular descriptors as predictors extended this improvement even further.
76
77
The transfer-learning results suffer from overfitting and more work on hyperparameter tuning is needed. However, the improvement in performance over fine-tuning is undeniable. 
78