💊ML-Project-Drug-Review-Dataset💊
This is an innovative machine learning project that utilizes patient reviews with many other attributes to analyze and evaluate the effectiveness of different drugs in treating specific conditions. By training on a vast dataset of patient experiences, the model can provide insightful ratings for the available drugs, based on their real-world usage.
The project demonstrates the power of advanced machine learning techniques to extract meaningful insights from unstructured data, ultimately enabling more informed decision-making in the healthcare industry.
pandas
: This is used for data manipulation and analysis.numPy
: This is used for numerical computing with Python.beautifulSoup
: This is a library used for web scraping purposes to pull data out of HTML and XML files.sklearn
: This stands for scikit-learn which is a popular machine learning library in Python, which provides tools for data preprocessing, classification, regression, clustering, and more. It is widely used in industry and academia for building machine learning models.seaborn
: This is a visualization library based on matplotlib used for making attractive and informative statistical graphics.matplotlib
: This is a plotting library for creating static, animated, and interactive visualizations in Python.The dataset used for this project is the famous Drug Review Dataset (Drugs.com) by UCI. The dataset can be found and downloaded from here.
The data provided is split into a train (75%) a test (25%) partition and stored in two .tsv (tab-separated-values) files, respectively.
Rating
git clone https://github.com/<your-github-username>/ML-Project-Drug-Review-Dataset.git
main.py
.cd ML-Project-Drug-Review-Dataset
git checkout -b <your_branch_name>
# Track the changes
git status
# Add changes to Index
git add .
git commit -m "your_commit_message"
git push origin <your_branch_name>
Compare & pull request
.Create pull request
.flowchart TD
A[Step 0 : Datasets provided by the UCI] --> B[Step 1 : Importing the necessary Libraries/Modules in the workspace]
B[Step 1 : Importing Libraries/Modules in the workspace] --> C[Step 2 : Loading and reading both the train and test datasets into the workspace using pandas]
C[Step 2 : Loading and reading the dataset into the workspace using pandas] --> D[Step 3 : Data Preprocessing Starts]
D[Step 3 : Data Preprocessing Starts] --> E[Step 3.1 : Extracting day, month, and year into separate columns]
E[Step 3.1 : Extracting day, month, and year into separate columns] --> F[Step 3.2 : Handling missing values using SimpleImputer]
F[Step 3.2 : Handling missing values using SimpleImputer] --> G[Step 3.3 : Convertiung the text using TfidfVectorizer in NLP]
G[Step 3.3 : Converting the text using TfidfVectorizer of NLP] --> H[Step 3.4 : Encoding the categorical columns using LabelEncoder]
H[Step 3.4 : Encoding the categorical columns using LabelEncoder] --> I[Step 3.5 : Converting the data types of the columns to reduce the memory usage]
I[Step 3.5 : Converting the data types of the columns to reduce the memory usage] --> J[Step 4 : Applying 4 different ML models to find the best accuracy]
J[Step 4 : Applying 4 different ML models to find the best accuracy] --> K[Step 5 : Plotting the different types of plots of every model]
1️⃣ Importing the necessary libraries and modules such as pandas, numpy, warnings, BeautifulSoup, MarkupResemblesLocatorWarning, SimpleImputer, ConvergenceWarning, TfidfVectorizer, LabelEncoder, LinearRegression, LogisticRegression, Perceptron, DecisionTreeClassifier, mean_squared_error, r2_score, accuracy_score, confusion_matrix, plot_confusion_matrix, seaborn, and matplotlib.
2️⃣ Reading the train and test datasets using pandas read_csv function and store them in train_df and test_df respectively.
3️⃣ Randomly upscaling and selecting 80% of the data from the training dataset using pandas sample function.
4️⃣ Converting the date column to datetime format using pandas to_datetime function.
5️⃣ Extracting day, month, and year into the separate columns using pandas dt attribute.
6️⃣ Suppressing the warnings by using warnings.filterwarnings and warnings.simplefilter functions to make the output look good.
7️⃣ Defining a function decode_html to decode HTML-encoded characters using BeautifulSoup.
8️⃣ Applying the decode_html function to the review column of both the train and test datasets.
9️⃣ Dropping the original date column and the first column using pandas drop function.
1️⃣0️⃣ Handling the missing values using SimpleImputer from scikit-learn.
1️⃣1️⃣ Assigning the old column names to the new dataframes using pandas columns attribute.
1️⃣2️⃣ Converting the text in the review column to numerical data using TfidfVectorizer from scikit-learn.
1️⃣3️⃣ Replacing the review column with the numerical data using pandas drop function and concat function.
1️⃣4️⃣ Encoding the categorical columns using LabelEncoder from scikit-learn.
1️⃣5️⃣ Converting the data types of columns to reduce the memory usage using pandas astype function.
1️⃣6️⃣ Splitting the train and test datasets into feature variables using pandas drop function.
1️⃣7️⃣ First, applying the LinearRegression model to this project datasets.
1️⃣8️⃣ Second, applying the LogisticRegression model.
1️⃣9️⃣ Third, applying the Perceptron model.
2️⃣0️⃣ Fourth, applying the DecisionTreeClassifier model.
Figure 1: Results of all the models
Figure 2: Linear Regression - Training Data Scatter Plot
Figure 3: Linear Regression - Testing Data Scatter Plot
Figure 4: Linear regression - Training and Testing Sets Scatter Plot
Figure 5: Linear Regression - Testing Data Residual Plot
Figure 6: Logistic Regression Accuracy
Figure 7: Logistic Regression Confusion Matrix
Figure 8: Scatter Plot -- Actual vs Predicted values for Perceptron Model
Figure 9: Step Plot -- Accuracy for Perceptron Model
Figure 10: Perceptron - Confusion Matrix
Figure 11: Decision Tree Classifier Accuracy
Figure 12: Decision Tree Classifier - Testing Data Scatter Plot
Figure 13: Decision Tree Classifier - Confusion Matrix
Note: The model's highest accuracy is approximately 50%. Further refinement through training and fine-tuning is required to achieve optimal results.
Read our Contributing Guidelines to learn about our development process, how to propose bugfixes and improvements, and how to build to ML-Project-Drug-Review-Dataset.
This project and everyone participating in it is governed by the Code of Conduct. By participating, you are expected to uphold this code.
GSSoC 2k23 |
Rakesh Roshan ![]() |