3537 lines (3536 with data), 739.6 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Project : Medical Treatment"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Problem statement"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A lot has been said during the past several years about how precision medicine and, more concretely, how genetic testing is going to disrupt the way diseases like cancer are treated.\n",
"\n",
"But this is only partially happening due to the huge amount of manual work still required. Once sequenced, a cancer tumor can have thousands of genetic mutations. But the challenge is distinguishing the mutations that contribute to tumor growth (drivers) from the neutral mutations (passengers). \n",
"\n",
"Currently this interpretation of genetic mutations is being done manually. This is a very time-consuming task where a clinical pathologist has to manually review and classify every single genetic mutation based on evidence from text-based clinical literature.\n",
"\n",
"We need to develop a Machine Learning algorithm that, using this knowledge base as a baseline, automatically classifies genetic variations.\n",
"\n",
"\n",
"\n",
"You can check all details about the competition from following link :\n",
"https://www.kaggle.com/c/msk-redefining-cancer-treatment\n",
"\n",
"In order to get the dataset please create a login account to Kaggle and go to this problem statement page(given above) and download 2 dataset\n",
"\n",
"***training_variants.zip*** and ***training_text.zip***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analysis of the problem statement"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets first understand the data set provided and using that dataset we will try to understand the above problem in Machine Learning world. Since, the dataset is huge lets load it using python itself"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [],
"source": [
"# Loading all required packages\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import re\n",
"import time\n",
"import warnings\n",
"import numpy as np\n",
"from nltk.corpus import stopwords\n",
"from sklearn.decomposition import TruncatedSVD\n",
"from sklearn.preprocessing import normalize\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"from sklearn.manifold import TSNE\n",
"import seaborn as sns\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.metrics import confusion_matrix\n",
"from sklearn.metrics.classification import accuracy_score, log_loss\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.linear_model import SGDClassifier\n",
"from imblearn.over_sampling import SMOTE\n",
"from collections import Counter\n",
"from scipy.sparse import hstack\n",
"from sklearn.multiclass import OneVsRestClassifier\n",
"from sklearn.svm import SVC\n",
"from sklearn.model_selection import StratifiedKFold \n",
"from collections import Counter, defaultdict\n",
"from sklearn.calibration import CalibratedClassifierCV\n",
"from sklearn.naive_bayes import MultinomialNB\n",
"from sklearn.naive_bayes import GaussianNB\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.model_selection import GridSearchCV\n",
"import math\n",
"from sklearn.metrics import normalized_mutual_info_score\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"warnings.filterwarnings(\"ignore\")\n",
"\n",
"from mlxtend.classifier import StackingClassifier\n",
"\n",
"from sklearn import model_selection\n",
"from sklearn.linear_model import LogisticRegression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are 2 data files provided for solving this problem. I have kept them inside a folder training. So lets load them"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [],
"source": [
"# Loading training_variants\n",
"data_variants = pd.read_csv('training/training_variants')\n",
"# Loading training_text dataset\n",
"data_text =pd.read_csv(\"training/training_text\",sep=\"\\|\\|\",engine=\"python\",names=[\"ID\",\"TEXT\"],skiprows=1)"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>Gene</th>\n",
" <th>Variation</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>FAM58A</td>\n",
" <td>Truncating Mutations</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>CBL</td>\n",
" <td>W802*</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>CBL</td>\n",
" <td>Q249E</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID Gene Variation Class\n",
"0 0 FAM58A Truncating Mutations 1\n",
"1 1 CBL W802* 2\n",
"2 2 CBL Q249E 2"
]
},
"execution_count": 114,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_variants.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 3321 entries, 0 to 3320\n",
"Data columns (total 4 columns):\n",
"ID 3321 non-null int64\n",
"Gene 3321 non-null object\n",
"Variation 3321 non-null object\n",
"Class 3321 non-null int64\n",
"dtypes: int64(2), object(2)\n",
"memory usage: 103.9+ KB\n"
]
}
],
"source": [
"data_variants.info()"
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>3321.000000</td>\n",
" <td>3321.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>1660.000000</td>\n",
" <td>4.365854</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>958.834449</td>\n",
" <td>2.309781</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>830.000000</td>\n",
" <td>2.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>1660.000000</td>\n",
" <td>4.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>2490.000000</td>\n",
" <td>7.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>3320.000000</td>\n",
" <td>9.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID Class\n",
"count 3321.000000 3321.000000\n",
"mean 1660.000000 4.365854\n",
"std 958.834449 2.309781\n",
"min 0.000000 1.000000\n",
"25% 830.000000 2.000000\n",
"50% 1660.000000 4.000000\n",
"75% 2490.000000 7.000000\n",
"max 3320.000000 9.000000"
]
},
"execution_count": 116,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_variants.describe()"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3321, 4)"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Checking dimention of data\n",
"data_variants.shape"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['ID', 'Gene', 'Variation', 'Class'], dtype='object')"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Clecking column in above data set\n",
"data_variants.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now lets explore about data_text"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>TEXT</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>Cyclin-dependent kinases (CDKs) regulate a var...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>Abstract Background Non-small cell lung canc...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>Abstract Background Non-small cell lung canc...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID TEXT\n",
"0 0 Cyclin-dependent kinases (CDKs) regulate a var...\n",
"1 1 Abstract Background Non-small cell lung canc...\n",
"2 2 Abstract Background Non-small cell lung canc..."
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_text.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So above dataset have 2 columns. ID and Text column. We can also observe column ID which is common in both the dataset. Lets keep exploring it."
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 3321 entries, 0 to 3320\n",
"Data columns (total 2 columns):\n",
"ID 3321 non-null int64\n",
"TEXT 3316 non-null object\n",
"dtypes: int64(1), object(1)\n",
"memory usage: 52.0+ KB\n"
]
}
],
"source": [
"data_text.info()"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>3321.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>1660.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>958.834449</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>830.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>1660.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>2490.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>3320.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID\n",
"count 3321.000000\n",
"mean 1660.000000\n",
"std 958.834449\n",
"min 0.000000\n",
"25% 830.000000\n",
"50% 1660.000000\n",
"75% 2490.000000\n",
"max 3320.000000"
]
},
"execution_count": 121,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_text.describe()"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['ID', 'TEXT'], dtype='object')"
]
},
"execution_count": 122,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_text.columns"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(3321, 2)"
]
},
"execution_count": 123,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# checking the dimentions\n",
"data_text.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, in short my datasets looks like this\n",
" * data_variants (ID, Gene, Variations, Class)\n",
" * data_text(ID, text)"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)"
]
},
"execution_count": 124,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_variants.Class.unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is descrete data so it is ***classification*** problem and since there are multiple descrete output possible so we can call it ***Multi class*** classification problem"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [],
"source": [
"stop_words = set(stopwords.words('english'))"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [],
"source": [
"def data_text_preprocess(total_text, ind, col):\n",
" if type(total_text) is not int:\n",
" string = \"\"\n",
" total_text = re.sub('[^a-zA-Z0-9\\n]', ' ', str(total_text))\n",
" total_text = re.sub('\\s+',' ', str(total_text))\n",
" total_text = total_text.lower()\n",
" \n",
" for word in total_text.split():\n",
" # if the word is a not a stop word then retain that word from text\n",
" if not word in stop_words:\n",
" string += word + \" \"\n",
" \n",
" data_text[col][ind] = string"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for index, row in data_text.iterrows():\n",
" if type(row['TEXT']) is str:\n",
" data_text_preprocess(row['TEXT'], index, 'TEXT')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's merge both the dataset. Remember that ID was common column. So lets use it to merge."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>Gene</th>\n",
" <th>Variation</th>\n",
" <th>Class</th>\n",
" <th>TEXT</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>FAM58A</td>\n",
" <td>Truncating Mutations</td>\n",
" <td>1</td>\n",
" <td>cyclin dependent kinases cdks regulate variety...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>CBL</td>\n",
" <td>W802*</td>\n",
" <td>2</td>\n",
" <td>abstract background non small cell lung cancer...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>CBL</td>\n",
" <td>Q249E</td>\n",
" <td>2</td>\n",
" <td>abstract background non small cell lung cancer...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>CBL</td>\n",
" <td>N454D</td>\n",
" <td>3</td>\n",
" <td>recent evidence demonstrated acquired uniparen...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>CBL</td>\n",
" <td>L399V</td>\n",
" <td>4</td>\n",
" <td>oncogenic mutations monomeric casitas b lineag...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID Gene Variation Class \\\n",
"0 0 FAM58A Truncating Mutations 1 \n",
"1 1 CBL W802* 2 \n",
"2 2 CBL Q249E 2 \n",
"3 3 CBL N454D 3 \n",
"4 4 CBL L399V 4 \n",
"\n",
" TEXT \n",
"0 cyclin dependent kinases cdks regulate variety... \n",
"1 abstract background non small cell lung cancer... \n",
"2 abstract background non small cell lung cancer... \n",
"3 recent evidence demonstrated acquired uniparen... \n",
"4 oncogenic mutations monomeric casitas b lineag... "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#merging both gene_variations and text data based on ID\n",
"result = pd.merge(data_variants, data_text,on='ID', how='left')\n",
"result.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's very important to look for missing values. Else they create problem in final analysis"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>Gene</th>\n",
" <th>Variation</th>\n",
" <th>Class</th>\n",
" <th>TEXT</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1109</th>\n",
" <td>1109</td>\n",
" <td>FANCA</td>\n",
" <td>S1088F</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1277</th>\n",
" <td>1277</td>\n",
" <td>ARID5B</td>\n",
" <td>Truncating Mutations</td>\n",
" <td>1</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1407</th>\n",
" <td>1407</td>\n",
" <td>FGFR3</td>\n",
" <td>K508M</td>\n",
" <td>6</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1639</th>\n",
" <td>1639</td>\n",
" <td>FLT1</td>\n",
" <td>Amplification</td>\n",
" <td>6</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2755</th>\n",
" <td>2755</td>\n",
" <td>BRAF</td>\n",
" <td>G596C</td>\n",
" <td>7</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID Gene Variation Class TEXT\n",
"1109 1109 FANCA S1088F 1 NaN\n",
"1277 1277 ARID5B Truncating Mutations 1 NaN\n",
"1407 1407 FGFR3 K508M 6 NaN\n",
"1639 1639 FLT1 Amplification 6 NaN\n",
"2755 2755 BRAF G596C 7 NaN"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result[result.isnull().any(axis=1)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see many rows with missing data. Now the question is what to do with this missing value."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"result.loc[result['TEXT'].isnull(),'TEXT'] = result['Gene'] +' '+result['Variation']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's cross check it once again if there is any missing values"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>Gene</th>\n",
" <th>Variation</th>\n",
" <th>Class</th>\n",
" <th>TEXT</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [ID, Gene, Variation, Class, TEXT]\n",
"Index: []"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result[result.isnull().any(axis=1)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Training, Test and Validation data\n",
"\n",
"Before we split the data into taining, test and validation data set. We want to ensure that all spaces in Gene and Variation column to be replaced by _."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"y_true = result['Class'].values\n",
"result.Gene = result.Gene.str.replace('\\s+', '_')\n",
"result.Variation = result.Variation.str.replace('\\s+', '_')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, so we can now start our split process in train, test and validation data set."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# Splitting the data into train and test set and cross validation\n",
"X_train, test_df, y_train, y_test = train_test_split(result, y_true, stratify=y_true, test_size=0.2)\n",
"train_df, cv_df, y_train, y_cv = train_test_split(X_train, y_train, stratify=y_train, test_size=0.2)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of data points in train data: 2124\n",
"Number of data points in test data: 665\n",
"Number of data points in cross validation data: 532\n"
]
}
],
"source": [
"print('Number of data points in train data:', train_df.shape[0])\n",
"print('Number of data points in test data:', test_df.shape[0])\n",
"print('Number of data points in cross validation data:', cv_df.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at the distribution of data in train, test and validation set."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"train_class_distribution = train_df['Class'].value_counts().sort_index()\n",
"test_class_distribution = test_df['Class'].value_counts().sort_index()\n",
"cv_class_distribution = cv_df['Class'].value_counts().sort_index()"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1 363\n",
"2 289\n",
"3 57\n",
"4 439\n",
"5 155\n",
"6 176\n",
"7 609\n",
"8 12\n",
"9 24\n",
"Name: Class, dtype: int64"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_class_distribution"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, what does above variable suggest us. This means in my train dataset we have class 1 values with count of 363, class 2 values having count of 289 and so on. It will be better idea to visualise it in graph format.\n",
"\n",
"*** Visualizing for train class distrubution***"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"my_colors = 'rgbkymc'\n",
"train_class_distribution.plot(kind='bar')\n",
"plt.xlabel('Class')\n",
"plt.ylabel(' Number of Data points per Class')\n",
"plt.title('Distribution of yi in train data')\n",
"plt.grid()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at distribution in form of percentage"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of data points in class 7 : 609 ( 28.672 %)\n",
"Number of data points in class 4 : 439 ( 20.669 %)\n",
"Number of data points in class 1 : 363 ( 17.09 %)\n",
"Number of data points in class 2 : 289 ( 13.606 %)\n",
"Number of data points in class 6 : 176 ( 8.286 %)\n",
"Number of data points in class 5 : 155 ( 7.298 %)\n",
"Number of data points in class 3 : 57 ( 2.684 %)\n",
"Number of data points in class 9 : 24 ( 1.13 %)\n",
"Number of data points in class 8 : 12 ( 0.565 %)\n"
]
}
],
"source": [
"sorted_yi = np.argsort(-train_class_distribution.values)\n",
"for i in sorted_yi:\n",
" print('Number of data points in class', i+1, ':',train_class_distribution.values[i], '(', np.round((train_class_distribution.values[i]/train_df.shape[0]*100), 3), '%)')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's visualize the same for test set"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"my_colors = 'rgbkymc'\n",
"test_class_distribution.plot(kind='bar')\n",
"plt.xlabel('Class')\n",
"plt.ylabel('Number of Data points per Class')\n",
"plt.title('Distribution of yi in test data')\n",
"plt.grid()\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of data points in class 7 : 191 ( 28.722 %)\n",
"Number of data points in class 4 : 137 ( 20.602 %)\n",
"Number of data points in class 1 : 114 ( 17.143 %)\n",
"Number of data points in class 2 : 91 ( 13.684 %)\n",
"Number of data points in class 6 : 55 ( 8.271 %)\n",
"Number of data points in class 5 : 48 ( 7.218 %)\n",
"Number of data points in class 3 : 18 ( 2.707 %)\n",
"Number of data points in class 9 : 7 ( 1.053 %)\n",
"Number of data points in class 8 : 4 ( 0.602 %)\n"
]
}
],
"source": [
"sorted_yi = np.argsort(-test_class_distribution.values)\n",
"for i in sorted_yi:\n",
" print('Number of data points in class', i+1, ':',test_class_distribution.values[i], '(', np.round((test_class_distribution.values[i]/test_df.shape[0]*100), 3), '%)')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's visualize for cross validation set"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYgAAAETCAYAAAAs4pGmAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3XmcHHWd//HXmyRAYHLABkYIkYAEFIn4IyOgrOuMiIuAwuKxnAYF81hhRQVF8ERXfsZd431t5AZhuFQQUEFgRBQUwjXchBhCEghEITAhHAmf/aO+I81Y010z6e7qJO/n4zGPdFV9u+rdVZ3+dF3fVkRgZmY20HplBzAzs9bkAmFmZrlcIMzMLJcLhJmZ5XKBMDOzXC4QZmaWywWiRUn6saQv1Gler5bUJ2lEGu6RdFQ95p3m9ytJ0+s1vyEs96uSlkp6bIjPe8X6yJl+t6TOuoRcQ0kKSdulx1Xfi5Vth7GcQyVdNdycQ1zWsHOuq+T7IJpP0nygHVgJrALuAc4GZkfES8OY11ER8dshPKcHODciTh3KstJzTwa2i4jDhvrcepI0CXgA2DoiHi8zy9pIUgBTImJuvdpKmgz8BRgVESvrkXMo1pScrcR7EOV5d0SMAbYGZgKfAU6r90Ikjaz3PFvE1sBf19TisBZvF1ubRIT/mvwHzAfeMWDcrsBLwE5p+Ezgq+nxBOBy4Cngb8DvyYr7Oek5K4A+4ARgMhDAkcAC4PqKcSPT/HqArwF/BpYBlwKbpmmdwMK8vMDewAvAi2l5d1TM76j0eD3g88DDwONke0bj0rT+HNNTtqXA56qsp3Hp+U+k+X0+zf8d6TW/lHKcmfPcu8iKcP/wqLS8Nw5cH9W2D3AycGHK8QxwN9BRJfPrgavTdloCfLZiPhcD5wJPA0cBGwDfBhanv28DG1Tb5mnaZ4BFKc/9wJ45OXYHHgNGVIz7N+DOivfbjWn+jwLfB9avaBtke4pQ8V5Mw59Oz1kMfHhA232B29JrfAQ4ueJ5C1LbvvT3ZuAI4IaKNm8BbiZ7X94MvKViWg/wX8Af0mu/CphQZVvUM+drgGuBv5K9j34KjC/7s6Thn1VlB1gX/8gpEGn8AuCj6fHf/1OSfZj/mOxDbhTwVl4+PPiKefHyh9/ZwMbAaPILxCJgp9TmErJDTlClQKTHJ/e3rZjew8sF4sPAXGBboA34GXDOgGw/Sbl2Bp4HXjfIejqbrHiNSc99ADhysJwDnnsCcEHF8P5A74AcRQvEc8A+wIi0LW4a5HljyD6Qjgc2TMO7VcznReAAsiI3GvgKcBOwObAZ8Efgv6ptc2AHsg+0LStey2sGyfMQsFfF8EXAienxNLIiMjLN417gExVtcwsE2ZeEJbz83jlvQNtOYGp6jW9IbQ8YbL1TUSCATYEngcNTroPT8D9VvM8eArZP668HmDnIa693zu2AvciK+mZkX7y+XfZnSaP/fIiptSwm+08y0IvAFmTH21+MiN9HetdWcXJELI+IFYNMPyci7oqI5cAXgA8MdtJ2iA4FvhkR8yKiDzgJOGjAIZUvR8SKiLgDuIOsULxCyvLvwEkR8UxEzAdmkX14FHEusI+ksWn4cLI9ruG4ISKujIhVaR7/kDfZD3gsImZFxHMp958qpt8YEb+IiJfSdjkU+EpEPB4RTwBf5uXXN9g2X0X2IbWjpFERMT8iHhokz/lkH7JIGkNW5M4HiIg5EXFTRKxM6/Z/gbcVWBcfAM6oeO+cXDkxInoioje9xjvT8orMF7Jv9Q9GxDkp1/nAfcC7K9qcEREPpPV3IdkeYcNzRsTciLg6Ip5P2+qbQ3hdaywXiNYykexwwkD/Q/at/CpJ8ySdWGBejwxh+sNk31InFEpZ3ZZpfpXzHkl2Ur5f5VVHz5LtaQw0AVg/Z14Ti4SIiMVkhyLeK2k88C6ywwLDMTDvhoOcQ5hE9g13MAO3Sd662jI9zt3mkZ1g/QTZB97jkrolbUm+84ADJW0AHAjcGhEPA0jaXtLlkh6T9DTw/ym2/bfkH987fydpN0nXSXpC0jLgPwrOt3/eDw8YN3CbF3nv1D2npM3Tul6U1te51dqvLVwgWoSkN5H9R7hh4LT0TfT4iNiW7NvUcZL27J88yCxr7WFMqnj8arJvrEuB5cBGFblGkO1SF53vYrITyJXzXkm2Cz8US1OmgfNaNIR5nAUcBryf7Nv7UJ47HI+QHasezMB1l7euFkP1bR4R50XEP6fnBvD13IVF3EP2wfgu4BCygtHvR2TfzqdExFjgs2SHsGp5lH9871Q6D7gMmBQR48gOk/XPd6jvnf75D2e71Tvn19L4N6T1dRjF1tcazQWiZJLGStoP6CY7tt+b02Y/SdtJEtlJtVXpD7IP3m2HsejDJO0oaSOyY+EXp0MoD5B9Q95X0iiyE8MbVDxvCTBZ0mDvnfOBT0raRlIb2TfTC2KIlwumLBcCp0gaI2lr4Diyb25F/QLYBfg42fmMRrsceJWkT0jaIOXerUr784HPS9pM0gTgi6TXN9g2l7SDpLenvYLnyE7Wrxpk/pB9EB4L/AvZOYh+Y9J8+yS9Fvhowdd4IXBExXvnSwOmjwH+FhHPSdqVrDD1e4LswoLB3q9XAttLOkTSSEn/DuxItl6Hqt45x5CdsH5K0kSyE+BrPReI8vxS0jNk3zo/R3ZM80ODtJ0C/JbsDXoj8MOI6EnTvkb2IfOUpE8NYfnnkJ18fIzshOqxABGxDDgaOJXsm9tyYGHF8/o/ZP4q6dac+Z6e5n092bXkzwEfG0KuSh9Ly59Htmd1Xpp/Iek49SXANmQnyxsqIp4hO5H5brL1+iDQVeUpXwVuAe4EeoFb0zgYfJtvQHZZ9NK0jM3Jvv0P5nyyE7LXRsTSivGfIvtQfIbsooELCr7GX5FdbXUt2SGwawc0ORr4Snpvf5Hsg7r/uc8CpwB/SO/X3QfM+69k53GOJ7ta6ARgvwG5C2lAzi+TfdlYBlxBE95PrcA3ytlaTdIXge2j5Bv7zNZEvlnH1lqSNiW7H6TolU9mVsGHmGytJOkjZIfvfhUR15edx2xN1LACIel0SY9LumvA+I9Juj91iPbfFeNPkjQ3TfvXRuWydUNE/CQiNo6I/yg7i9maqpGHmM4ku33/71ePSOoiu6P1DRHxvKTN0/gdgYPIuinYEvitpO3TlSxmZlaChu1BpN36gTd9fZTs1vjnU5v+jtb2B7rTXYp/IbvqYNdGZTMzs9qafZJ6e+Ctkk4hu/zxUxFxM9kNYjdVtFtIgTtmJ0yYEJMnT65LsOXLl7PxxhvXZV714kzFtGImaM1czlTM2p5pzpw5SyNis1rtml0gRgKbkHUS9ibgQknbkn9HYu71t5JmADMA2tvb+cY3vlGXYH19fbS1DXbXfjmcqZhWzAStmcuZilnbM3V1dQ3s0iRfPXr8G+yPrFfEuyqGfw10Vgw/RNaNw0lknbL1j/8N8OZa8582bVrUy3XXXVe3edWLMxXTipkiWjOXMxWztmcCbokW7M31F8DbIessjKwztqVkfaIclLon2IbsLtI/NzmbmZlVaNghJkn9t/hPkLSQrC+U04HT06WvLwDTUzW7W9KFZD+9uRI4JnwFk5lZqRpWICLi4EEm5XZ5EBGnkPV/YmZmLcB3UpuZWS4XCDMzy+UCYWZmuVwgzMwsl7v7NrOGmXziFTXbHD91JUfUaDd/5r71imRD4D0IMzPL5QJhZma5XCDMzCyXC4SZmeVygTAzs1wuEGZmlssFwszMcrlAmJlZLhcIMzPL5QJhZma5XCDMzCyXC4SZmeVygTAzs1wNKxCSTpf0ePr96YHTPiUpJE1Iw5L0XUlzJd0paZdG5TIzs2IauQdxJrD3wJGSJgF7AQsqRr8LmJL+ZgA/amAuMzMroGEFIiKuB/6WM+lbwAlAVIzbHzg7MjcB4yVt0ahsZmZWW1PPQUh6D7AoIu4YMGki8EjF8MI0zszMSqKIqN1quDOXJgOXR8ROkjYCrgPeGRHLJM0HOiJiqaQrgK9FxA3pedcAJ0TEnJx5ziA7DEV7e/u07u7uumTt6+ujra2tLvOqF2cqphUzQWvmanam3kXLarZpHw1LVlRvM3XiuDolKmZt33ZdXV1zIqKjVrtm/uToa4BtgDskAWwF3CppV7I9hkkVbbcCFufNJCJmA7MBOjo6orOzsy7henp6qNe86sWZimnFTNCauZqdqdZPiUL2k6Ozeqt/FM0/tLNOiYrxtss07RBTRPRGxOYRMTkiJpMVhV0i4jHgMuCD6Wqm3YFlEfFos7KZmdk/auRlrucDNwI7SFoo6cgqza8E5gFzgZ8ARzcql5mZFdOwQ0wRcXCN6ZMrHgdwTKOymJnZ0PlOajMzy+UCYWZmuVwgzMwslwuEmZnlcoEwM7NcLhBmZpbLBcLMzHK5QJiZWS4XCDMzy+UCYWZmuVwgzMwslwuEmZnlcoEwM7NcLhBmZpbLBcLMzHK5QJiZWS4XCDMzy+UCYWZmuRr5m9SnS3pc0l0V4/5H0n2S7pT0c0njK6adJGmupPsl/WujcpmZWTGN3IM4E9h7wLirgZ0i4g3AA8BJAJJ2BA4CXp+e80NJIxqYzczMamhYgYiI64G/DRh3VUSsTIM3AVulx/sD3RHxfET8BZgL7NqobGZmVluZ5yA+DPwqPZ4IPFIxbWEaZ2ZmJVFENG7m0mTg8ojYacD4zwEdwIEREZJ+ANwYEeem6acBV0bEJTnznAHMAGhvb5/W3d1dl6x9fX20tbXVZV714kzFtGImaM1czc7Uu2hZzTbto2HJiuptpk4cV6dExazt266rq2tORHTUajdyKDOVtAkwKSLuHG4wSdOB/YA94+XqtBCYVNFsK2Bx3vMjYjYwG6CjoyM6OzuHG+UVenp6qNe86sWZimnFTNCauZqd6YgTr6jZ5vipK5nVW/2jaP6hnXVKVIy3XabmISZJPZLGStoUuAM4Q9I3h7MwSXsDnwHeExHPVky6DDhI0gaStgGmAH8ezjLMzKw+ipyDGBcRTwMHAmdExDTgHbWeJOl84EZgB0kLJR0JfB8YA1wt6XZJPwaIiLuBC4F7gF8Dx0TEqmG9IjMzq4sih5hGStoC+ADwuaIzjoiDc0afVqX9KcApRedvZmaNVWQP4ivAb4C5EXGzpG2BBxsby8zMylZzDyIiLgIuqhieB7y3kaHMzKx8RU5S/3c6ST1K0jWSlko6rBnhzMysPEUOMb0znaTej+xy1O2BTzc0lZmZla5IgRiV/t0HOD8i/latsZmZrR2KXMX0S0n3ASuAoyVtBjzX2FhmZla2mnsQEXEi8GagIyJeBJaTda5nZmZrsaJdbUwE9pK0YcW4sxuQx8zMWkTNAiHpS0AnsCNwJfAu4AZcIMzM1mpF9iDeB+wM3BYRH5LUDpza2FhmmckFO3sr0inc/Jn71iOS2TqjyFVMKyLiJWClpLHA48C2jY1lZmZlK7IHcUv67eifAHOAPtzTqpnZWq9IVxtHp4c/lvRrYOzq/B6EmZmtGQYtEJJ2qTYtIm5tTCQzM2sF1fYgZlWZFsDb65zFzMxayKAFIiK6mhnEzMxay6BXMUk6TNLhOeM/IumQxsYyM7OyVbvM9XjgFznjL0jTzMxsLVatQIyIiGcGjkxdf4/Kaf8Kkk6X9LikuyrGbSrpakkPpn83SeMl6buS5kq6s9oJcjMza45qBWKUpI0HjpQ0Bli/wLzPBPYeMO5E4JqImAJck4Yh675jSvqbAfyowPzNzKyBqhWI04CLJU3uH5Eed6dpVUXE9cDA347YHzgrPT4LOKBi/NmRuQkYL2mL2vHNzKxRql3F9A1JfcDvJLWRXdq6HJgZEcP9ht8eEY+m+T8qafM0fiLwSEW7hWnco8NcjpmZrSZFRO1GWYFQ3jmJGs+bDFweETul4aciYnzF9CcjYhNJVwBfi4gb0vhrgBMiYk7OPGeQHYaivb19Wnd391AiDaqvr4+2tra6zKtenAl6Fy2r2aZ9NCxZUXteUyeOq0Oi4rz96rf9vO3qm6mrq2tORHTUalfo9yAiom/1IwGwRNIWae9hC7KO/yDbY5hU0W4rYPEgWWYDswE6Ojqis7OzLsF6enqo17zqxZko1Evr8VNXMqu39lt5/qGddUhUnLdf/baft105mYr05lpPlwHT0+PpwKUV4z+YrmbaHVjWfyjKzMzKUbVASFpP0luGM2NJ5wM3AjtIWijpSGAm2S/TPQjslYYh+yGiecBcsl5jj86ZpZmZNVHV/bqIeEnSLLLfpB6SiDh4kEl75rQN4JihLsPMzBqnyCGmqyS9V5IansbMzFpGkZPUxwEbA6skrQBE9qV/bEOTmZlZqYr8YNCYZgQxM7PWUvMQU7qy6DBJX0jDkyTt2vhoZmZWpiLnIH5IdpK6v4vvPuAHDUtkZmYtocg5iN0iYhdJtwFExJOSinTWZ2Zma7AiexAvShpB1hcTkjYDXmpoKjMzK12RPYjvAj8H2iWdArwP+HxDU9XZ5IK3+9fqFmD+zH3rFcnMrOUVuYrpp5Lm8PINbgdExL2NjWVmZmUr1FkfsBHQf5hpdOPimJlZqyhymesXyX7cZ1NgAnCGpDXqEJOZmQ1dkT2Ig4H/FxHPAUiaCdwKfLWRwczMrFxFrmKaD2xYMbwB8FBD0piZWcsosgfxPHC3pKvJzkHsBdwg6bsAEXFsA/OZmVlJihSIn6e/fj2NiWJmZq2kyGWuZzUjiJmZtZZm/+SomZmtIVwgzMws15AKRPqN6tX+oSBJn5R0t6S7JJ0vaUNJ20j6k6QHJV3gDgHNzMpV5Ea58ySNlbQxcA9wv6RPD3eBkiYCxwIdEbET2R3aBwFfB74VEVOAJ4Ejh7sMMzNbfUX2IHaMiKeBA4ArgVcDh6/mckcCoyWNJOvG41Hg7cDFafpZaXlmZlaSIgVilKRRZB/Yl0bEi6uzwIhYBHwDWEBWGJYBc4CnImJlarYQmLg6yzEzs9WjiKjeQDoW+AxwB7Av2R7EuRHx1mEtUNoEuAT4d+Ap4KI0/KWI2C61mQRcGRFTc54/A5gB0N7ePq27u7vmMnsXLavZpn00LFlRvc3UieNqzqee+vr6aGtra+oya2l2pnptO/D2gzV3+3nb1TdTV1fXnIjoqNWuyI1yv4yI7/YPSFoAfHg1sr0D+EtEPJHm9zPgLcB4SSPTXsRWwOK8J0fEbGA2QEdHR3R2dtZcYK3feYDs9yBm9VZfHfMPrb2seurp6aHI62umZmeq17YDbz9Yc7eft105mYocYrqkciCyXY7aX9sHtwDYXdJGkkT2OxP3ANeR/RgRwHTg0tVYhpmZraZBy7ak1wKvB8ZJOrBi0lhe2XnfkETEnyRdTNYj7ErgNrI9giuAbklfTeNOG+4yzMxs9VXbr9sB2A8YD7y7YvwzwEdWZ6ER8SXgSwNGzwN2XZ35mplZ/QxaICLiUuBSSW+OiBubmMnMzFpAkZPUcyV9Fphc2T4iVudEtZmZtbgiBeJS4PfAb4FVjY1jZmatokiB2CgiPtPwJGZm1lKKXOZ6uaR9Gp7EzMxaSpEC8XGyIrFC0tOSnpH0dKODmZlZuYr8otyYZgRZ10wueIdprTtR58/ct16RzMxeoeqNchFxn6Rd8qZHxK2Ni2VmZmWrtgdxHFmneLNypgVZ99xmZraWqnaj3Iz0b1fz4piZWauoeQ4i/RbER4F/SaN6gP9d3d+FMDOz1lbkPogfAaOAH6bhw9O4oxoVyszMylekQLwpInauGL5W0h2NCmRmZq2hyH0QqyS9pn9A0ra4yw0zs7VekT2ITwPXSZoHCNga+FBDU5mZWemK3Ch3jaQpZL8PIeC+iHi+4cnMzKxURa5i2hA4Gvhnsvsffi/pxxHxXKPDmZlZeYocYjqb7FfkvpeGDwbOAd7fqFBmZla+IgVihwFXMV23ulcxSRoPnArsRLZX8mHgfuACsh8mmg98ICKeXJ3lmJnZ8BW5iuk2Sbv3D0jaDfjDai73O8CvI+K1wM7AvcCJwDURMQW4Jg2bmVlJihSI3YA/SpovaT5wI/A2Sb2S7hzqAiWNJbsr+zSAiHghIp4C9gfOSs3OAg4Y6rzNzKx+ihxi2rvOy9wWeAI4Q9LOwByy35xoj4hHASLiUUmb13m5ZmY2BIqI5i5Q6gBuAvaIiD9J+g7wNPCxiBhf0e7JiNgk5/kzyHqZpb29fVp3d3fNZfYuWlazTftoWLKiepupE8fVnE9RrZipiL6+Ptra2pq2vHqtJ1j711URa+r287arb6aurq45EdFRq10ZBeJVwE0RMTkNv5XsfMN2QGfae9gC6ImIHarNq6OjI2655Zaayyz64zyzeqvvUNXzx3laMVMRPT09dHZ2Nm159VpPsPavqyLW1O3nbVffTJIKFYgi5yDqKiIeAx6R1P/hvydwD3AZMD2Nmw5c2uxsZmb2siLnIBrhY8BPJa0PzCPrumM94EJJRwIL8H0WZmalKnIn9e5kN8m9DlgfGAEsj4ixw11oRNwO5O3e7DnceZqZWX0VOcT0fbK7px8ERpP9DsT3qj7DzMzWeIUOMUXEXEkjImIV2eWpf2xwLjMzK1mRAvFsOldwu6T/Bh4FNm5sLDMzK1uRQ0yHp3b/CSwHJgEHNjKUmZmVr0iBOCAinouIpyPiyxFxHLBfo4OZmVm5ihSI6TnjjqhzDjMzazGDnoOQdDBwCLCNpMsqJo0B/troYGZmVq5qJ6n/SHZCegIwq2L8M8CQe3E1M7M1y6AFIiIeBh4G3ty8OGZm1ipqnoOQtLukmyX1SXpB0ipJTzcjnJmZlcd3UpuZWS7fSW1mZrl8J7WZmeUa7p3U721kKDMzK1/NPYiIeFjSZunxlxsfyczMWsGgexDKnCxpKXAf8ICkJyR9sXnxzMysLNUOMX0C2AN4U0T8U0RsAuwG7CHpk01JZ2Zmpal2iOmDwF4RsbR/RETMk3QYcBXwrUaHM7PiJp94Rc02x09dyRE12s2fuW+9ItkartoexKjK4tAvIp4ARq3ugiWNkHSbpMvT8DaS/iTpQUkXpCunzMysJNUKxAvDnFbUx4F7K4a/DnwrIqYATwJH1mEZZmY2TNUKxM6Sns75ewaYujoLlbQVsC9wahoW8Hbg4tTkLOCA1VmGmZmtnmqd9Y1o4HK/DZxA1nU4wD8BT0XEyjS8EJjYwOWbmVkNiojmLlDaD9gnIo6W1Al8CvgQcGNEbJfaTAKujIh/2FORNAOYAdDe3j6tu7u75jJ7Fy2r2aZ9NCxZUb3N1Injas6nqFbMVERfXx9tbW1NW1691hN4XYHf50U1e9sVUc9MXV1dcyKio1a7Qn0x1dkewHsk7QNsCIwl26MYL2lk2ovYClic9+SImA3MBujo6IjOzs6aC6x11QZkV3fM6q2+OuYfWntZRbVipiJ6enooss7rpV7rCbyuwO/zopq97YooI1PTC0REnAScBNC/BxERh0q6CHgf0E32M6eXNjubWRFFLicFX1Jqa74ifTE1y2eA4yTNJTsncVrJeczM1mllHGL6u4joAXrS43nArmXmMTOzl7XSHoSZmbUQFwgzM8vlAmFmZrlcIMzMLJcLhJmZ5XKBMDOzXC4QZmaWywXCzMxyuUCYmVkuFwgzM8vlAmFmZrlcIMzMLJcLhJmZ5XKBMDOzXC4QZmaWywXCzMxyuUCYmVkuFwgzM8vlAmFmZrmaXiAkTZJ0naR7Jd0t6eNp/KaSrpb0YPp3k2ZnMzOzl5WxB7ESOD4iXgfsDhwjaUfgROCaiJgCXJOGzcysJE0vEBHxaETcmh4/A9wLTAT2B85Kzc4CDmh2NjMze5kioryFS5OB64GdgAURMb5i2pMR8Q+HmSTNAGYAtLe3T+vu7q65nN5Fy2q2aR8NS1ZUbzN14ria8ymqFTMV0dfXR1tbW9OWV6/1BPVbV0Uygd9T0JqZimj2+7yIembq6uqaExEdtdqVViAktQG/A06JiJ9JeqpIgajU0dERt9xyS81lTT7xipptjp+6klm9I6u2mT9z35rzKaoVMxXR09NDZ2dn05ZXr/UE9VtXRTKB31PQmpmKaPb7vIh6ZpJUqECUchWTpFHAJcBPI+JnafQSSVuk6VsAj5eRzczMMmVcxSTgNODeiPhmxaTLgOnp8XTg0mZnMzOzl9XeL6+/PYDDgV5Jt6dxnwVmAhdKOhJYALy/hGxmZpY0vUBExA2ABpm8ZzOzmJnZ4HwntZmZ5XKBMDOzXGWcg7AWVfSSxCNqtGv2JYlm1hjegzAzs1wuEGZmlssFwszMcrlAmJlZLhcIMzPL5QJhZma5XCDMzCyXC4SZmeVygTAzs1wuEGZmlstdbZiZlaxe3dxAfbu68R6EmZnlcoEwM7NcLhBmZpbL5yDMbJ3Sqsf7W1HL7UFI2lvS/ZLmSjqx7DxmZuuqlioQkkYAPwDeBewIHCxpx3JTmZmtm1qqQAC7AnMjYl5EvAB0A/uXnMnMbJ2kiCg7w99Jeh+wd0QclYYPB3aLiP+saDMDmJEGdwDur9PiJwBL6zSvenGmYloxE7RmLmcqZm3PtHVEbFarUaudpFbOuFdUsIiYDcyu+4KlWyKio97zXR3OVEwrZoLWzOVMxThTptUOMS0EJlUMbwUsLimLmdk6rdUKxM3AFEnbSFofOAi4rORMZmbrpJY6xBQRKyX9J/AbYARwekTc3aTF1/2wVR04UzGtmAlaM5czFeNMtNhJajMzax2tdojJzMxahAuEmZnlcoEwM7NcLhAtRNJrJe0pqW3A+L1LzLSrpDelxztKOk7SPmXlySPp7LIzVJL0z2k9vbPkHLtJGpsej5b0ZUm/lPR1SeNKynSspEm1WzaPpPUlfVDSO9LwIZK+L+kYSaNKzPUaSZ+S9B1JsyT9R7O3m09SDyDpQxFxRgnLPRY4BrgXeCPw8Yi4NE27NSJ2KSHTl8j6xRoJXA3sBvQA7wB+ExGnlJBp4GXPArqAawEi4j0lZPpzROyaHn+EbDv+HHgn8MuImNnsTCnL3cDO6erA2cCzwMXAnmn8gSVkWgYsBx4Czgcuiognmp1jQKafkr3HNwKeAtqAn5GtJ0XE9BIyHQu8G/gdsA9wO/Ak8G/A0REidasBAAAD20lEQVTR05QgEeG/ij9gQUnL7QXa0uPJwC1kRQLgthIzjSD7j/M0MDaNHw3cWVKmW4FzgU7gbenfR9Pjt5WU6baKxzcDm6XHGwO9ZWRKy7+3cr0NmHZ7WeuK7MjFO4HTgCeAXwPTgTElZboz/TsSWAKMSMMq8X3eW5FjI6AnPX51Mz8PWuo+iGaRdOdgk4D2ZmapMCIi+gAiYr6kTuBiSVuT3wVJM6yMiFXAs5IeioinU74Vkl4qKVMH8HHgc8CnI+J2SSsi4ncl5QFYT9ImZB98ivSNOCKWS1pZYq67KvaI75DUERG3SNoeeLGkTBERLwFXAVelQzjvAg4GvgHU7B+oAdZLN+ZuTPZhPA74G7ABUNohJrKCtSrlGAMQEQuaedhrnSwQZEXgX8l22SoJ+GPz4wDwmKQ3RsTtABHRJ2k/4HRgakmZXpC0UUQ8C0zrH5mOg5ZSINKHy7ckXZT+XUL57+NxwByy909IelVEPJbOJZVV3AGOAr4j6fNknbzdKOkR4JE0rQyvWB8R8SJZbwmXSRpdTiROA+4j21v+HHCRpHnA7mQ9SpfhVOBmSTcB/wJ8HUDSZmTFqynWyXMQkk4DzoiIG3KmnRcRh5SQaSuyb+yP5UzbIyL+UEKmDSLi+ZzxE4AtIqK32ZlysuwL7BERny07y0CSNgLaI+IvJecYA2xLVkgXRsSSErNsHxEPlLX8wUjaEiAiFksaT3aebUFE/LnETK8HXgfcFRH3lZJhXSwQZmZWmy9zNTOzXC4QZmaWywXCrCBJr5LULekhSfdIulLS9pLuKjubWSOUffWH2RpBkshufjsrIg5K495IeZdFmzWc9yDMiukCXoyIH/ePSJckP9I/LGmypN9LujX9vSWN30LS9ZJul3SXpLdKGiHpzDTcK+mTzX9JZtV5D8KsmJ3I7nWo5nFgr4h4TtIUsq4kOoBDSF2TSOq/M/2NwMSI2AkgXVpp1lJcIMzqZxTw/XToaRWwfRp/M3B6ugP2F+nu73nAtpK+B1xBdmexWUvxISazYu6m4m7yQXySrC+fncn2HNYHiIjrye6GXQScI+mDEfFkatdD1rnfqY2JbTZ8LhBmxVwLbJB6awUgdYO+dUWbccCjqTuQw8m6biD1p/V4RPyErFuHXdLd6OtFxCXAF4Cm99ZrVosPMZkVEBEh6d+Ab0s6EXgOmA98oqLZD4FLJL0fuI6sW2vIepz9tKQXgT7gg8BE4AxJ/V/STmr4izAbIne1YWZmuXyIyczMcrlAmJlZLhcIMzPL5QJhZma5XCDMzCyXC4SZmeVygTAzs1wuEGZmluv/AHkdg3o13MnIAAAAAElFTkSuQmCC\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"my_colors = 'rgbkymc'\n",
"cv_class_distribution.plot(kind='bar')\n",
"plt.xlabel('Class')\n",
"plt.ylabel('Data points per Class')\n",
"plt.title('Distribution of yi in cross validation data')\n",
"plt.grid()\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of data points in class 7 : 153 ( 28.759 %)\n",
"Number of data points in class 4 : 110 ( 20.677 %)\n",
"Number of data points in class 1 : 91 ( 17.105 %)\n",
"Number of data points in class 2 : 72 ( 13.534 %)\n",
"Number of data points in class 6 : 44 ( 8.271 %)\n",
"Number of data points in class 5 : 39 ( 7.331 %)\n",
"Number of data points in class 3 : 14 ( 2.632 %)\n",
"Number of data points in class 9 : 6 ( 1.128 %)\n",
"Number of data points in class 8 : 3 ( 0.564 %)\n"
]
}
],
"source": [
"sorted_yi = np.argsort(-train_class_distribution.values)\n",
"for i in sorted_yi:\n",
" print('Number of data points in class', i+1, ':',cv_class_distribution.values[i], '(', np.round((cv_class_distribution.values[i]/cv_df.shape[0]*100), 3), '%)')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now question is because we need log-loss as final evaluation metrics how do we say that model we are going to build will be good model. For doing this we will build a random model and will evaluate log loss. Our model should return lower log loss value than this.\n",
"\n",
"## Building a Random model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, so we need to generate 9 random numbers because we have 9 class such that their sum must be equal to 1 because sum of Probablity of all 9 classes must be equivalent to 1."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [],
"source": [
"test_data_len = test_df.shape[0]\n",
"cv_data_len = cv_df.shape[0]"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Log loss on Cross Validation Data using Random Model 2.4661836840027522\n"
]
}
],
"source": [
"# we create a output array that has exactly same size as the CV data\n",
"cv_predicted_y = np.zeros((cv_data_len,9))\n",
"for i in range(cv_data_len):\n",
" rand_probs = np.random.rand(1,9)\n",
" cv_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])\n",
"print(\"Log loss on Cross Validation Data using Random Model\",log_loss(y_cv,cv_predicted_y, eps=1e-15))"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Log loss on Test Data using Random Model 2.4863899316645934\n"
]
}
],
"source": [
"#we create a output array that has exactly same as the test data\n",
"test_predicted_y = np.zeros((test_data_len,9))\n",
"for i in range(test_data_len):\n",
" rand_probs = np.random.rand(1,9)\n",
" test_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])\n",
"print(\"Log loss on Test Data using Random Model\",log_loss(y_test,test_predicted_y, eps=1e-15))\n"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"# Lets get the index of max probablity\n",
"predicted_y =np.argmax(test_predicted_y, axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([6, 1, 7, 2, 8, 6, 7, 8, 1, 0, 3, 1, 0, 0, 4, 3, 2, 4, 6, 1, 3, 2,\n",
" 5, 2, 7, 2, 7, 0, 2, 5, 1, 2, 0, 1, 1, 6, 5, 1, 5, 0, 6, 6, 7, 3,\n",
" 1, 3, 1, 5, 1, 0, 3, 7, 4, 5, 4, 2, 2, 2, 6, 6, 5, 5, 3, 0, 1, 8,\n",
" 6, 8, 8, 1, 7, 8, 6, 0, 2, 3, 2, 3, 2, 7, 1, 7, 6, 4, 1, 3, 6, 1,\n",
" 8, 8, 4, 2, 2, 5, 3, 2, 8, 0, 1, 4, 2, 2, 2, 5, 8, 8, 5, 5, 4, 2,\n",
" 6, 1, 4, 4, 8, 6, 3, 6, 2, 0, 4, 6, 4, 3, 4, 2, 3, 5, 5, 2, 5, 8,\n",
" 0, 8, 7, 6, 6, 2, 4, 4, 2, 4, 6, 6, 3, 0, 5, 4, 8, 1, 2, 7, 6, 2,\n",
" 8, 0, 8, 7, 5, 5, 6, 8, 6, 4, 0, 8, 2, 6, 7, 0, 1, 2, 3, 1, 0, 2,\n",
" 8, 1, 3, 4, 4, 8, 1, 0, 6, 6, 3, 5, 1, 0, 6, 2, 4, 4, 1, 2, 4, 3,\n",
" 8, 8, 3, 3, 0, 8, 4, 4, 2, 3, 0, 4, 1, 6, 5, 8, 6, 0, 5, 6, 1, 5,\n",
" 8, 0, 6, 8, 5, 1, 4, 4, 0, 8, 4, 7, 6, 1, 2, 1, 6, 7, 1, 2, 2, 0,\n",
" 8, 0, 3, 0, 3, 1, 3, 8, 0, 5, 4, 8, 2, 2, 4, 3, 2, 6, 8, 5, 7, 2,\n",
" 2, 4, 4, 2, 3, 2, 3, 1, 3, 0, 1, 1, 6, 7, 3, 5, 4, 6, 4, 3, 1, 2,\n",
" 2, 8, 4, 3, 8, 5, 4, 6, 7, 5, 0, 3, 8, 7, 3, 7, 2, 6, 3, 2, 0, 7,\n",
" 6, 3, 6, 8, 7, 3, 3, 4, 7, 1, 4, 0, 5, 3, 6, 1, 7, 4, 8, 5, 5, 8,\n",
" 0, 7, 7, 2, 1, 3, 4, 3, 4, 6, 8, 1, 0, 0, 1, 2, 7, 6, 7, 8, 3, 4,\n",
" 4, 3, 6, 8, 0, 2, 5, 2, 1, 4, 4, 5, 4, 7, 7, 2, 5, 2, 5, 4, 5, 0,\n",
" 3, 5, 7, 5, 0, 6, 6, 8, 6, 5, 8, 4, 8, 5, 4, 1, 1, 2, 4, 8, 3, 1,\n",
" 6, 1, 5, 5, 1, 6, 7, 6, 3, 1, 3, 5, 5, 0, 0, 8, 3, 8, 8, 3, 8, 2,\n",
" 3, 2, 6, 3, 2, 3, 2, 3, 3, 3, 2, 8, 6, 4, 3, 4, 1, 3, 6, 2, 2, 6,\n",
" 4, 2, 7, 1, 2, 2, 3, 7, 3, 6, 6, 2, 0, 2, 6, 3, 4, 8, 8, 4, 7, 5,\n",
" 5, 3, 7, 7, 0, 4, 0, 5, 3, 4, 3, 7, 8, 5, 7, 1, 0, 1, 2, 3, 1, 2,\n",
" 8, 3, 5, 7, 7, 4, 1, 0, 5, 7, 3, 3, 3, 7, 0, 1, 2, 1, 6, 4, 6, 2,\n",
" 4, 4, 7, 0, 8, 2, 8, 6, 8, 7, 5, 8, 2, 4, 3, 4, 7, 7, 3, 1, 0, 0,\n",
" 7, 8, 4, 1, 8, 3, 3, 4, 2, 6, 4, 4, 4, 4, 4, 1, 4, 4, 4, 3, 3, 1,\n",
" 1, 3, 7, 1, 2, 6, 6, 6, 6, 5, 1, 6, 1, 1, 6, 8, 5, 2, 8, 2, 5, 2,\n",
" 4, 5, 2, 1, 2, 4, 5, 6, 5, 5, 0, 7, 5, 3, 6, 3, 5, 0, 0, 1, 2, 1,\n",
" 1, 3, 1, 6, 0, 3, 5, 4, 7, 3, 6, 7, 8, 3, 6, 3, 2, 7, 4, 1, 2, 6,\n",
" 3, 0, 2, 7, 4, 4, 0, 6, 0, 0, 2, 1, 1, 8, 5, 8, 8, 3, 3, 6, 3, 2,\n",
" 0, 0, 1, 7, 3, 1, 2, 3, 7, 2, 2, 0, 7, 0, 8, 0, 1, 1, 5, 5, 5, 5,\n",
" 0, 1, 0, 3, 6], dtype=int64)"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Lets see the output. these will be 665 values present in test dataset\n",
"predicted_y"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So you can see the index value ranging from 0 to 8. So, lets make it as 1 to 9 we will increase this value by 1."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"predicted_y = predicted_y + 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Confusion Matrix"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"C = confusion_matrix(y_test, predicted_y)\n"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x504 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"labels = [1,2,3,4,5,6,7,8,9]\n",
"plt.figure(figsize=(20,7))\n",
"sns.heatmap(C, annot=True, cmap=\"YlGnBu\", fmt=\".3f\", xticklabels=labels, yticklabels=labels)\n",
"plt.xlabel('Predicted Class')\n",
"plt.ylabel('Original Class')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Precision matrix"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"B =(C/C.sum(axis=0))"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x504 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(20,7))\n",
"sns.heatmap(B, annot=True, cmap=\"YlGnBu\", fmt=\".3f\", xticklabels=labels, yticklabels=labels)\n",
"plt.xlabel('Predicted Class')\n",
"plt.ylabel('Original Class')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Recall matrix"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"A =(((C.T)/(C.sum(axis=1))).T)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x504 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(20,7))\n",
"sns.heatmap(A, annot=True, cmap=\"YlGnBu\", fmt=\".3f\", xticklabels=labels, yticklabels=labels)\n",
"plt.xlabel('Predicted Class')\n",
"plt.ylabel('Original Class')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluating Gene Column\n",
"\n",
"Now we will look at each independent column to make sure its relavent for my target variable but the question is, how? Let's understand with our first column Gene which is categorial in nature.\n",
"\n",
"So, lets explore column ***Gene*** and lets look at its distribution. "
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of Unique Genes : 237\n",
"BRCA1 166\n",
"TP53 89\n",
"EGFR 79\n",
"BRCA2 78\n",
"PTEN 71\n",
"KIT 67\n",
"BRAF 57\n",
"ERBB2 46\n",
"CDKN2A 40\n",
"PDGFRA 39\n",
"Name: Gene, dtype: int64\n"
]
}
],
"source": [
"unique_genes = train_df['Gene'].value_counts()\n",
"print('Number of Unique Genes :', unique_genes.shape[0])\n",
"# the top 10 genes that occured most\n",
"print(unique_genes.head(10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets see the number of unique values present in gene"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"237"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"unique_genes.shape[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets look at the comulative distribution of unique Genes values"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD8CAYAAACMwORRAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3Xt4VNW9//H3yo1ACEkgIUCABEIQCJcEAmhRCdoq8FPQU1S8HEtbRbQcj/bUR60etbb2WLWtrbVSsd5qK4JtlSoK3iIWuSsgCUIIBAiBEEIScr/N+v2RIQZIyBAmmczM5/U8eZLZs/ee71pMPqysWbPHWGsRERHfEuDpAkRExP0U7iIiPkjhLiLigxTuIiI+SOEuIuKDFO4iIj5I4S4i4oMU7iIiPkjhLiLig4I89cDR0dE2ISGhXcdWVFQQFhbm3oK8jL/3gb+3H9QH/tr+zZs3H7XWxrS1n8fCPSEhgU2bNrXr2IyMDNLT091bkJfx9z7w9/aD+sBf22+M2efKfpqWERHxQQp3EREfpHAXEfFBCncRER+kcBcR8UFthrsx5kVjzBFjzPZW7jfGmN8bY3YbY7YZY8a7v0wRETkbrozcXwamn+H+GUCS82s+8Ny5lyUiIueizXXu1trVxpiEM+wyG3jVNn5e3zpjTKQxpr+19pCbahQR8TrVdQ2UVNZxrKKW4srapu9F5bVcMqIv4wZFdujju+NNTHHAgWa385zbTgt3Y8x8Gkf3xMbGkpGR0a4HLC8vb/exvsLf+8Df2w/qg85sf4PDUl4H5bWWsjpLWa1t+vmb75y0rbqh9fMVH9pH8eDgDq3ZHeFuWtjW4qduW2ufB54HSEtLs+19d5m/vjOtOX/vA39vP6gP2tt+h8NyvLr5iLqO4opajjlH18cqaptuFztvH6+ub/V8PbsFERUWQu8eIcT3afweFRZC77AQonqE0Dss2Pm9cXtk92CCAjt+LYs7wj0PGNTs9kAg3w3nFRE5I2st5TX1FFfUnRTGzadBGsP6m/uLK2txtDj8hJCgAPo0hXIIg6J6nBzSYSeHd2SPYLoFBXZuo13kjnBfDiw0xiwBJgOlmm8Xkfaormv4ZvRc2XwU3Wx0Xd543+HiSio+eI+6hpaTOijANAvjYIbH9vxmBN1sJH3i/t5hIXQPDsSYliYjvE+b4W6MeR1IB6KNMXnAw0AwgLV2EbACmAnsBiqB73dUsSLiPWrrHZRUfjPdcabRdXFF4zRJVV3LE9XGQFSPEKJ6NIbw4N496BtURfKw+GYj6ZOnP8K7BflMULeHK6tlrm/jfgv8yG0ViUiX0+CwlFbVtTCiPhHWdU3bT9xXVtP6PHV4aFDTCLpveCjnxfZqcdrjRFhHdA8mMODkoG6ccx/R0U33Wh675K+IeF5tvYN9RRXkFFZw4FglRaeGtvN7SVUdtpV56h4hgSeNmBP69Gg1pKPCgonsHkJIkN4c39EU7iI+zlpLUUUtOUfK2XO0gj2F5eQUNn4/UFxFQ7NXF0MCA4hqNr0xsn+vb0K6RzC9e3Y7aY46qkcIocFd8wVFf6dwF/ERNfUN7C+qJKcpvCvIKSxnT2H5SUv5ugUFMCQ6jOQBEcwaN4ChMT0ZGhNGfJ8weoX69zy1L1G4i3gRay1Hy2vJKSwn40Ad/34nqzHAjzZOqzRf4tevVyhDY8KYnRLH0JiwxhCPDiMusjsBAQpwX6dwF+mCqusa2FdUyR5ncOc0m0opazYKDw3ex5DonoyOi2B2ShyJMWEMje7JkJgwenbTr7c/07++iAdV1zWw/WApOwvKmk2jVJBXfPIovH9E4yj86tQ4hkY3jsKP5HzFf1w+TaNwaZHCXaQTHauoZVPuMTbvK2bTvmK+yiultsEBQPfgQIZEhzF2YERjiMeEkRjTkyHRYYS1MArPyA9QsEurFO4iHcRay96jFWzaV8ym3GNs2lfMnsIKoHFVypiBEXx/SgIT4qNIjougf69QhbW4jcJdxE1q6hvYfvA4m/cdY2NuMV/sK6aoohaAyB7BpMVHcc2EQaQlRDEmLkJLCKVDKdxF2qmksrZpemVT7jG25pVSW984xZLQpwfp5/VlYkIUaQlRDI3uqVG5dCqFu4gLrLXsK6pk077ippH57iPlQOMFqkbHRXDz+fGkJfRmQnwUMeHdPFyx+DuFu0gLausdZOaXNo7McxtH50fLawDoFRrEhPgork6NY0J8FOMGRtI9RFMs0rUo3EWA49V1ziA/xqbcYrbmlVBd1zjFMqh3dy5OimZCQhRp8b1J6qspFun6FO7itw6WVPFB5mE+2FHA+j3HqHdYAgMMyQN6ccOkeNISokiLj6Jvr1BPlypy1hTu4jestWTmH+eDrAI+yCog69BxABJjwrjloqFcnBRNyuBIeoTo10K8n57F4tNq6x2s31vEB1kFfJhVQH5pNcbAhMFR3D9jBN8ZFcvQmJ6eLlPE7RTu4nOOV9eRsbOQD7IKyPj6CGU19YQGB3BRUgx3fWc4l47oS5+eWs0ivk3hLj7hYEkVHzqnW9btKaLeYekTFsKMMf34zqh+XDgsWitaxK8o3MUrWWvJyj/OqqzDfJBVQGZ+4/z50JgwfnjREC4bFUvKoKjTPppNxF8o3MWrFFfU8s8vD/LS59UcWPlZ0/z5fc7580TNn4sACnfxAg6H5d+7j/LGpgN8kFlAbYODIb0C+PlVo5kxuh/Rmj8XOY3CXbqsvOJKlm3K483NeRwsqSKyRzA3TB7MdRMHUbDzC9LPj/d0iSJdlsJdupTqugZWZRWwdOMB1uQcBeDCYdHcP7Nx2qVbUOOLogU7PVmlSNencJcuISv/OEs3HeCfXx6ktKqOuMju/PelScyZMJCBUT08XZ6I11G4i8eUVtWxfMtBlm7K46uDpYQEBnBZcizXTRzElMRoXb9F5Bwo3KVTORyWdXuLWLrxAO9tP0xNvYMR/cJ5+MpRXJUSR1RYiKdLFPEJCnfpFOU19by+fj9/WbeP/ccqCQ8N4pq0gVyXNpjRcb0wRqN0EXdSuEuHKiyr4eXP9/KXtfs4Xl3PpITe3P2dJKYn99c7RkU6kMJdOsTeoxUs/mwPb27Oo67BwfTkftw2NZGUQZGeLk3ELyjcxa225ZWw6NMc3tt+mOCAAL47YSC3XjREV14U6WQKdzln1lpWZx9lUUYOa/cUER4axIKpiXx/SgJ9w/VBFyKeoHCXdqtvcPDuV4dY9Okedhw6Tmyvbvx05giunzSY8NBgT5cn4tcU7nLWqmobWLrpAIs/20NecRWJMWE88d2xzE4d0PQOUhHxLJfC3RgzHfgdEAi8YK19/JT7BwOvAJHOfe6z1q5wc63iYcUVtbyyNpdXPs+luLKO8YMjeeiKUXx7ZKzecCTSxbQZ7saYQOBZ4DtAHrDRGLPcWpvVbLcHgaXW2ueMMaOAFUBCB9QrHnC8uo4/fZrDi//OpaqugUtH9OW2qYlMTIjS+nSRLsqVkfskYLe1dg+AMWYJMBtoHu4W6OX8OQLId2eR4hk19Q38Ze0+/vDJbkoq67hy3AAWThvGef3CPV2aiLTBlXCPAw40u50HTD5ln0eAVcaY/wLCgG+7pTrxCIfD8vbWgzy1chcHS6q4KCmae6ePYHRchKdLExEXGWvtmXcw5hrgcmvtLc7b/wlMstb+V7N9fuw816+NMRcAfwZGW2sdp5xrPjAfIDY2dsKSJUvaVXR5eTk9e/r3uumO6ANrLV8dbWDZrjoOlDmI7xXANcNDGB3d9V4k1XNAfeCv7Z82bdpma21aW/u5MnLPAwY1uz2Q06ddfghMB7DWrjXGhALRwJHmO1lrnweeB0hLS7Pp6ekuPPzpMjIyaO+xvsLdfbAtr4T/W/E1a/cUMah3d3439zyuHDugy75QqueA+sDf298WV8J9I5BkjBkCHATmAjecss9+4FLgZWPMSCAUKHRnodIxco9W8OSqnby77RC9w0J45MpR3DA5npCgAE+XJiLnoM1wt9bWG2MWAitpXOb4orU20xjzKLDJWrsc+B9gsTHmbhpfXJ1n25rvEY8qLKvhmY+z+dv6/QQHBnDnJcO49eKhevORiI9waZ27c836ilO2PdTs5yxgintLk45Q1+DgpTV7+d2H2VTXO7h+0iDuvDRJlwkQ8TF6h6ofWb+niP99ezu7Csr59si+/HTmSF3QS8RHKdz9QGFZDf+3Ygf/+PIgcZHdWXxzGt8ZFevpskSkAyncfZi1lmWb8vj5u1lU1zWwcNowfjRtmD4kQ8QPKNx9VMHxau77+zY+2VnI+UN789jVY0jUFIyI31C4+xhrLcu35vPQ25nU1DfwyJWjuPmChC67Xl1EOobC3YdU1Tbwv29v583NeYwfHMlT14zTC6Yifkrh7iNyj1aw4LXN7Cwo485Lk/jvS5MI1GhdxG8p3H3A+9sPc8+yrQQGGl6aN5H08/p6uiQR8TCFuxera3Dw5MqdPL96D+MGRvDsjeMZGNXD02WJSBegcPdSJdUObly8ng25x/jP8+N58IqR+og7EWmicPdC6/YU8dDn1dTZWp6+LoWrUuM8XZKIdDEKdy9ireXVtft49J0sYrrDq/OnMDxWn4okIqdTuHuJ2noHDy/P5PUN+/n2yL58N65cwS4irdJFu71AeU09P3h5I69v2M+PpiXy/H+m0T1IyxxFpHUauXdxReU1fP/ljWTmH+epa8YxZ8JAT5ckIl5A4d6FHThWyfde3EB+aRWLb57AJSN0JUcRcY3CvYvKLijjpj+vp6q2gdd+OJm0hN6eLklEvIjCvQvaVVDGDYvXYYxh6YILGNGvl6dLEhEvo3DvYnYVlHH98+sIDDC8Pv98XaZXRNpFq2W6EAW7iLiLRu5dxM7DjVMxgQGGJfPP16V6ReScaOTeBZwI9qBABbuIuIfC3cP2Hq3gxhfWExRoeP1WBbuIuIemZTzoYEkVN72wHoe1LLlFwS4i7qORu4cUlddw0wvrOV5dx6s/mMSwvrpOjIi4j8LdA6rrGrj11U3kl1Tx0ryJjI6L8HRJIuJjNC3TyRwOy/8s28qXB0r44w3j9c5TEekQGrl3sidX7eTdbYe4f8YIZozp7+lyRMRHKdw70ZIN+3kuI4cbJg/m1ouGerocEfFhCvdO8ll2IQ+8tZ2pw2N4dFYyxuh67CLScRTunWDn4TLueO0Lkvr25A83pBIUqG4XkY6llOlgR45X84OXN9I9JJAX500kPDTY0yWJiB/QapkOVFlbzy2vbqK4spalt13AgMjuni5JRPyERu4dxOGw3LVkC9sPlvLM9alayy4incqlcDfGTDfG7DTG7DbG3NfKPtcaY7KMMZnGmL+5t0zv89sPd7Eqq4AH/98oLh2pj8cTkc7V5rSMMSYQeBb4DpAHbDTGLLfWZjXbJwm4H5hirS02xvTtqIK9wbvbDvHMx7u5Lm0Q35+S4OlyRMQPuTJynwTsttbusdbWAkuA2afscyvwrLW2GMBae8S9ZXqPzPxSfrJsKxPio3j0Ki15FBHPcCXc44ADzW7nObc1NxwYboxZY4xZZ4yZ7q4CvcnR8hrmv7qZyB7BPHfTeLoFBXq6JBHxU66slmlp6GlbOE8SkA4MBD4zxoy21pacdCJj5gPzAWJjY8nIyDjbegEoLy9v97EdpcFheWJjNUeOO/jp5FCyNq8jq+3D2q0r9kFn8vf2g/rA39vfFlfCPQ8Y1Oz2QCC/hX3WWWvrgL3GmJ00hv3G5jtZa58HngdIS0uz6enp7So6IyOD9h7bUX71/tfsLM7h19eM47sTBnb443XFPuhM/t5+UB/4e/vb4sq0zEYgyRgzxBgTAswFlp+yz1vANABjTDSN0zR73FloV/bRjgKey8jh+kmDOiXYRUTa0ma4W2vrgYXASmAHsNRam2mMedQYM8u520qgyBiTBXwC3GOtLeqooruSA8cq+fHSrYzq34uHr0z2dDkiIoCL71C11q4AVpyy7aFmP1vgx84vv1FT38DCv32Bw2F57qbxhAbrBVQR6Rp0+YFz8Ni7O9iaV8qimyYQ3yfM0+WIiDTR5Qfa6V9b83l17T5uuXAI00f383Q5IiInUbi3Q35JFff/4ysmxEdx74wRni5HROQ0CvezZK3lwbe20+Cw/PbaFIJ1bXYR6YKUTGdp+dZ8Pv76CD+5/DwG9+nh6XJERFqkcD8LR8treGR5JqmDI5n3rQRPlyMi0iqF+1n42b+yqKhp4InvjiUwQBcEE5GuS+Huog+yCvjX1nwWXjKMpNhwT5cjInJGCncXVNc18PDb2xnRL5wFUxM9XY6ISJsU7i5YvHoP+aXV/GxWMiFB6jIR6fqUVG0oOF7NHzNymJ7cj8lD+3i6HBERlyjc2/Dkyp00OCz3z9SblUTEeyjcz2D7wVL+/kUe35+SoGvHiIhXUbifwRMrdxLRPZgfXTLM06WIiJwVhXsr1u8pYvWuQu5IT6RXaLCnyxEROSsK9xZYa3lq1U5ie3Xj5gsSPF2OiMhZU7i34NNdhWzMLWbhJUn6AA4R8UoK91NYa/n1ql0M6t2d69IGtX2AiEgXpHA/xcdfH+Grg6XceUmS3rAkIl5L6dWMtZbff5TN4N49uCo1ztPliIi0m8K9mU93FbI1r5Q70hP1IRwi4tWUYE7WWn73UTZxkd35j/EDPV2OiMg5Ubg7rdldxJf7S7g9PVFz7SLi9ZRiTn9anUPf8G5ck6ZRu4h4P4U7kF1QxmfZR7n5gni6BWldu4h4P4U78NLnuXQLCuD6SYM9XYqIiFv4fbiXVNbyjy/yuColjj49u3m6HBERt/D7cH99wwGq6xx8/8IET5ciIuI2fh3udQ0OXl2by7cS+zCiXy9PlyMi4jZ+He4rMw9zqLSaH0wZ4ulSRETcyq/D/aU1ucT36cElI/p6uhQREbfy23DfeqCEzfuKmfetBAICjKfLERFxK78N97+t30+PkEDmTNCblkTE9/hluFfU1PPOtnyuGNufcH2Enoj4IJfC3Rgz3Riz0xiz2xhz3xn2m2OMscaYNPeV6H7vbjtERW0D103Uh3GIiG9qM9yNMYHAs8AMYBRwvTFmVAv7hQN3AuvdXaS7Ldm4n8SYMMYPjvJ0KSIiHcKVkfskYLe1do+1thZYAsxuYb+fA08A1W6sz+12Hynji/0lXDdxEMbohVQR8U2uhHsccKDZ7TzntibGmFRgkLX2HTfW1iHe2HiAoACja7aLiE8LcmGfloa3tulOYwKA3wLz2jyRMfOB+QCxsbFkZGS4VOSpysvL23VsvcOyZH0l42IC2b5pbbseu6tobx/4Cn9vP6gP/L39bXEl3POA5q88DgTym90OB0YDGc5pjn7AcmPMLGvtpuYnstY+DzwPkJaWZtPT09tVdEZGBu059uOvCyir3cSCy1NJHxXbrsfuKtrbB77C39sP6gN/b39bXJmW2QgkGWOGGGNCgLnA8hN3WmtLrbXR1toEa20CsA44Ldi7gn9+mU9Uj2CmDo/xdCkiIh2qzXC31tYDC4GVwA5gqbU20xjzqDFmVkcX6C5l1XWsyjzMFWMH6GP0RMTnuTItg7V2BbDilG0PtbJv+rmX5X4rMwuoqXdwVWpc2zuLiHg5vxnCvvXlQQb37sH4wZGeLkVEpMP5RbgXHK9mTc5RrkqN09p2EfELfhHuy7fkYy1crSkZEfETfhHu73x1iDFxEQyJDvN0KSIincLnwz2vuJKtB0qYOaa/p0sREek0Ph/u728/DMCM0f08XImISOfx+XB/b/thRvbvRYKmZETEj/h0uB8urWbzvmJmatQuIn7Gp8N9ZaZzSkbz7SLiZ3w63Fd8dYjhsT0Z1renp0sREelUPhvuhWU1bMg9xozRGrWLiP/x2XBfmXkYa9ESSBHxSz4b7u9tP8TQ6DCGx2pKRkT8j0+Ge1F5Dev2HGPGmH66loyI+CWfDPcPdxTQ4LCabxcRv+WT4Z6xs5B+vUJJHtDL06WIiHiEz4V7fYODf+8+ysXDozUlIyJ+y+fCfWteCWXV9Vysz0kVET/mc+H+6a6jBBi4cFi0p0sREfEYnwv3tTlHGTMwksgeIZ4uRUTEY3wq3GvrHWzNK2VifJSnSxER8SifCvesQ8eprXcwXuEuIn7Op8L9i33FAExQuIuIn/OpcN+8v5i4yO7E9gr1dCkiIh7lU+H+5b5iUgdHeroMERGP85lwLyyrIb+0mpRBCncREZ8J9x2HjgMwSpccEBHxwXDvr3AXEfGZcM86dJwBEaF685KICD4U7jsOHWekRu0iIoCPhHt1XQM5hRWabxcRcfKJcM8uKKfBYTVyFxFx8olwzzpUCqBwFxFx8olw33GojB4hgcT37uHpUkREugSXwt0YM90Ys9MYs9sYc18L9//YGJNljNlmjPnIGBPv/lJbl5V/nBH9wgkI0CcviYiAC+FujAkEngVmAKOA640xo07Z7UsgzVo7FngTeMLdhbbGWsuOw1opIyLSnCsj90nAbmvtHmttLbAEmN18B2vtJ9baSufNdcBA95bZurziKsqq67VSRkSkGVfCPQ440Ox2nnNba34IvHcuRZ2NLOc7UzVyFxH5RpAL+7Q0kW1b3NGYm4A0YGor988H5gPExsaSkZHhWpWnKC8vbzr2vd21GKBw1xYy9vjPnHvzPvBH/t5+UB/4e/vb4kq45wGDmt0eCOSfupMx5tvAA8BUa21NSyey1j4PPA+QlpZm09PTz7ZeADIyMjhx7NKDm4nvc5zLvz2tXefyVs37wB/5e/tBfeDv7W+LK9MyG4EkY8wQY0wIMBdY3nwHY0wq8CdglrX2iPvLbF12QTnD+oZ35kOKiHR5bYa7tbYeWAisBHYAS621mcaYR40xs5y7PQn0BJYZY7YYY5a3cjq3qmtwsPdoBUmxPTvj4UREvIYr0zJYa1cAK07Z9lCzn7/t5rpcsq+oknqHZViMwl1EpDmvfofq7iNlABq5i4icwqvDPbugHIBEjdxFRE7i1eG+u7CcuMjuhHVzaXZJRMRveHW4N66U0ahdRORUXhvuDQ5LTmE5SQp3EZHTeG24HyyuoqbeoRdTRURa4LXhnu1cKaNpGRGR03lxuDeulBkWo3enioicymuXmew+Uk7f8G5E9Aj2dCnSTnV1deTl5VFdXX3Wx0ZERLBjx44OqMp7+Hsf+Hr7Q0NDGThwIMHB7cs4rw337CNaKePt8vLyCA8PJyEhAWPO7oqeZWVlhIf7919t/t4Hvtx+ay1FRUXk5eUxZMiQdp3DK6dlrLXkHNFKGW9XXV1Nnz59zjrYRXydMYY+ffq066/aE7wy3ItrLOU19QyL9c3/tf2Jgl2kZef6u+GV4X64ovGzQhKjwzxciXi7w4cPM3fuXBITExk1ahQzZ85k165dHfqY6enpbNq06Yz7PP3001RWVjbdnjlzJiUlJW6t4+WXX2bhwoUALFq0iFdffbXVfTMyMvj8889bvX/58uU8/vjjAMybN48333zzrGr55S9/edLtb33rW2d1/Nn6+uuvSUlJITU1lZycnJPuKy8v5/bbbycxMZHU1FQmTJjA4sWLO7SejuCl4e4AIEHhLufAWsvVV19Neno6OTk5ZGVl8ctf/pKCggJPl3ZauK9YsYLIyMgOe7wFCxZw8803t3r/mcK9vr6eWbNmcd9997X78U8N9zP9R+IOb731FrNnz+bLL78kMTHxpPtuueUWoqKiyM7O5ssvv+T999/n2LFjHVpPR/DKcD9S6aBbUAD9eoV6uhTxYp988gnBwcEsWLCgaVtKSgoXXXQRGRkZXHHFFU3bFy5cyMsvvwxAQkICP/3pT7ngggtIS0vjiy++4PLLLycxMZFFixYBnPH45m6//XbS0tJITk7m4YcfBuD3v/89+fn5TJs2jWnTpjU95tGjR7n33nv54x//2HT8I488wq9//WsAnnzySSZOnMjYsWObznWql156ieHDhzN16lTWrFlz0nmeeuqppscfNWoUY8eOZe7cueTm5rJo0SJ++9vfkpKSwmeffca8efP48Y9/zLRp07j33ntP+isA4MMPP+Siiy5i+PDhvPPOOwCn7XPFFVeQkZHBfffdR1VVFSkpKdx4440A9OzZ+HqatZZ77rmH0aNHM2bMGN54442m/p05cyZz5sxhxIgR3HjjjVh7+qd/btmyhfPPP5+xY8dy9dVXU1xczIoVK3j66ad54YUXmvr3hJycHDZs2MAvfvELAgIa4zEmJoZ77723aZ+W+jk3N5eRI0dy6623kpyczGWXXUZVVVXTOadPn86ECRO46KKL+PrrrwFYtmwZo0ePZty4cVx88cUt/nudC69cLVNQaYnv04OAAM3X+oqf/SuTrPzjLu/f0NBAYGDgGfcZNaAXD1+Z3Or927dvZ8KECS4/ZnODBg1i7dq13H333cybN481a9ZQXV1NcnLySf9ZtOWxxx6jd+/eNDQ0cOmll7Jt2zbuvPNOfvOb3/DJJ58QHR190v5z587lrrvu4o477gBg6dKlvP/++6xatYrs7Gw2bNiAtZZZs2axevXqk0Lj0KFDPPzww2zevJmIiAimTZtGamrqaTU9/vjj7N27l27dulFSUkJkZCQLFiygZ8+e/OQnPwHgz3/+M7t27eLDDz8kMDDwtP+4cnNz+fTTT8nJyWHatGns3r271T54/PHH+cMf/sCWLVtOu+8f//gHW7ZsYevWrRw9epSJEyc2tWnbtm1kZmYyYMAApkyZwpo1a7jwwgtPOv7mm2/mmWeeYerUqTz00EP87Gc/4+mnnz6tPSdkZmYybty4pmA/VWv9PHjwYLKzs3n99ddZvHgx1157LX//+9+56aabmD9/PosWLSIpKYn169dzxx138PHHH/Poo4+ycuVK4uLi3D7lBl46ci+ocJDQR1My4jmzZjV+CNmYMWOYPHky4eHhxMTEEBoaela/qEuXLmX8+PGkpqaSmZlJVlbWGfdPTU3lyJEj5Ofn89VXXxEVFcXgwYNZtWoVq1atIjU1lfHjx/P111+TnZ190rHr168nPT2dmJgYQkJCuO6661p8jLFjx3LjjTfy2muvERTU+vjvmmuuafU/2GuvvZaAgACSkpIYOnRo02j1bP373//m+uuvJzDe2LXlAAAIlUlEQVQwkNjYWKZOncrGjRsBmDBhAgMHDiQgIICUlBRyc3NPOra0tJSSkhKmTp0KwPe+9z1Wr159Vo//2GOPkZKSwoABAwDO2M9DhgwhJSWlqbbc3FzKy8v5/PPPueaaa0hJSeG2227j0KFDAEyZMoV58+axePFiGhoa2tU/Z+J1I/cGh+VIpWWI5tt9yplG2C1xxxrn5OTkVl/4CwoKwuFwNN0+dUlat27dAAgICGj6+cTt+vr6No8H2Lt3L0899RQbN24kKiqKefPmubT0bc6cObz55pvs37+fuXPnAo3TF/fffz+33XbbGY91ZQXGu+++y+rVq1m+fDk///nPyczMbHG/sLDWfwdPfRxjjEt9cqqWplpOCAkJafo5MDCQ+vr6Ns/XllGjRrF161YcDgcBAQE88MADPPDAAydNE7XUz7m5uSc9DwIDA6mqqsLhcBAZGdniXyWLFi1i/fr1vPvuu6SkpLBlyxb69Olzzm04wetG7odKq6i3EK+Ru5yjSy65hJqampNWQmzcuJFPP/2U+Ph4srKyqKmpobS0lI8++uiszu3K8cePHycsLIyIiAgKCgp47733mu4LDw+nrKysxXPPnTuXJUuW8NZbbzFnzhwALr/8cl588UXKyxsvy3Hw4EGOHDn5s+onT55MRkYGRUVF1NXVsWzZstPO7XA4OHDgANOmTeOJJ56gpKSE8vLyM9bTkmXLluFwOMjJyWHPnj2cd955JCQksGXLlqbH2LBhQ9P+wcHB1NXVnXaeiy++mDfeeIOGhgYKCwtZvXo1kyZNcqmGiIgIoqKi+OyzzwD4y1/+0jSKb82wYcNIS0vjwQcfbBpNV1dXN/0n40o/N9erVy+GDBnS1NfWWrZu3Qo0zsVPnjyZRx99lOjoaA4cOOBSu1zldSP33KONKwgSont4uBLxdsYY/vnPf3LXXXfx+OOPExoaSkJCAk8//TSDBg3i2muvZezYsSQlJbU4N30mrhw/btw4UlNTSU5OZujQoUyZMqXpvvnz5zNjxgz69+/PJ598ctJxycnJlJWVMWDAAPr37w/AZZddxo4dO7jggguAxhckX3vtNfr27dt0XP/+/XnkkUe44IIL6N+/P+PHjz9tOqChoYGbbrqJ0tJSrLXcfffdREZGcuWVVzJnzhzefvttnnnmmTbbf9555zF16lQKCgpYtGgRoaGhTJkyhSFDhjBmzBhGjx7N+PHjT2rv2LFjGT9+PH/961+btl999dWsXbuWcePGYYzhiSeeoF+/fi5P87zyyissWLCAyspKhg4dyksvvdTmMS+88AL33HMPw4YNo3fv3nTv3p1f/epXQOv9fKbXf/76179y++2384tf/IK6ujrmzp3LuHHjuOeee8jOzsZay6WXXsq4ceNcapOrzJn+7OlIaWlptq21vi15bd0+HnxrO2vvv4T+Ed07oDLvkJGRQXp6uqfLOCc7duxg5MiR7TrWl9967ip/7wN/aH9LvyPGmM3W2rS2jvW6aZm+4d1I7RtIbLiWQYqItMbrpmUuS+5HSGGolkGKiJyB143cRUSkbQp38ShPveYj0tWd6++Gwl08JjQ0lKKiIgW8yClOXM89NLT9ry163Zy7+I6BAweSl5dHYWHhWR9bXV19Tk98X+DvfeDr7T/xSUztpXAXjwkODm73p8xkZGSc9dpzX+PvfeDv7W+LpmVERHyQwl1ExAcp3EVEfJDHLj9gjCkE9rXz8GjgqBvL8Ub+3gf+3n5QH/hr++OttTFt7eSxcD8XxphNrlxbwZf5ex/4e/tBfeDv7W+LpmVERHyQwl1ExAd5a7g/7+kCugB/7wN/bz+oD/y9/WfklXPuIiJyZt46chcRkTPwunA3xkw3xuw0xuw2xtzn6Xo6gzEm1xjzlTFmizFmk3Nbb2PMB8aYbOf3KE/X6U7GmBeNMUeMMdubbWuxzabR753PiW3GmPGtn9k7tNL+R4wxB53Pgy3GmJnN7rvf2f6dxpjLPVO1exljBhljPjHG7DDGZBpj/tu53W+eB+fCq8LdGBMIPAvMAEYB1xtjRnm2qk4zzVqb0mzp133AR9baJOAj521f8jIw/ZRtrbV5BpDk/JoPPNdJNXaklzm9/QC/dT4PUqy1KwCcvwNzgWTnMX90/q54u3rgf6y1I4HzgR852+pPz4N286pwByYBu621e6y1tcASYLaHa/KU2cArzp9fAa7yYC1uZ61dDRw7ZXNrbZ4NvGobrQMijTH9O6fSjtFK+1szG1hira2x1u4FdtP4u+LVrLWHrLVfOH8uA3YAcfjR8+BceFu4xwEHmt3Oc27zdRZYZYzZbIyZ79wWa609BI2/BEDfVo/2Ha212Z+eFwudUw4vNpuK8/n2G2MSgFRgPXoeuMTbwr2lD071h+U+U6y142n8s/NHxpiLPV1QF+Mvz4vngEQgBTgE/Nq53afbb4zpCfwduMtae/xMu7awzWf64Wx5W7jnAYOa3R4I5Huolk5jrc13fj8C/JPGP7kLTvzJ6fx+xHMVdprW2uwXzwtrbYG1tsFa6wAW883Ui8+23xgTTGOw/9Va+w/nZr9+HrjK28J9I5BkjBlijAmh8UWk5R6uqUMZY8KMMeEnfgYuA7bT2O7vOXf7HvC2ZyrsVK21eTlws3O1xPlA6Yk/233JKfPHV9P4PIDG9s81xnQzxgyh8QXFDZ1dn7sZYwzwZ2CHtfY3ze7y6+eBy6y1XvUFzAR2ATnAA56upxPaOxTY6vzKPNFmoA+NKwWynd97e7pWN7f7dRqnHupoHJH9sLU20/jn+LPO58RXQJqn6++g9v/F2b5tNAZZ/2b7P+Bs/05ghqfrd1MfXEjjtMo2YIvza6Y/PQ/O5UvvUBUR8UHeNi0jIiIuULiLiPgghbuIiA9SuIuI+CCFu4iID1K4i4j4IIW7iIgPUriLiPig/w8CRdhSkkDnbQAAAABJRU5ErkJggg==\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"s = sum(unique_genes.values);\n",
"h = unique_genes.values/s;\n",
"c = np.cumsum(h)\n",
"plt.plot(c,label='Cumulative distribution of Genes')\n",
"plt.grid()\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, now we need to convert these categorical variable to appropirate format which my machine learning algorithm will be able to take as an input.\n",
"\n",
"So we have 2 techniques to deal with it. \n",
"\n",
"<ol><li>\n",
" ***One-hot encoding*** </li>\n",
" <li> ***Response Encoding*** </li>\n",
"</ol>\n",
"\n",
"Let's use both of them to see which one work the best. So lets start encoding using one hot encoder"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [],
"source": [
"# one-hot encoding of Gene feature.\n",
"gene_vectorizer = CountVectorizer()\n",
"train_gene_feature_onehotCoding = gene_vectorizer.fit_transform(train_df['Gene'])\n",
"test_gene_feature_onehotCoding = gene_vectorizer.transform(test_df['Gene'])\n",
"cv_gene_feature_onehotCoding = gene_vectorizer.transform(cv_df['Gene'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's check the number of column generated after one hot encoding. One hot encoding will always return higher number of column."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2124, 236)"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_gene_feature_onehotCoding.shape"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['abl1',\n",
" 'acvr1',\n",
" 'ago2',\n",
" 'akt1',\n",
" 'akt2',\n",
" 'akt3',\n",
" 'alk',\n",
" 'apc',\n",
" 'ar',\n",
" 'araf',\n",
" 'arid1a',\n",
" 'arid1b',\n",
" 'arid2',\n",
" 'arid5b',\n",
" 'asxl2',\n",
" 'atm',\n",
" 'atr',\n",
" 'aurka',\n",
" 'axin1',\n",
" 'b2m',\n",
" 'bap1',\n",
" 'bcl10',\n",
" 'bcl2',\n",
" 'bcl2l11',\n",
" 'bcor',\n",
" 'braf',\n",
" 'brca1',\n",
" 'brca2',\n",
" 'brd4',\n",
" 'brip1',\n",
" 'btk',\n",
" 'card11',\n",
" 'carm1',\n",
" 'casp8',\n",
" 'cbl',\n",
" 'ccnd1',\n",
" 'ccnd2',\n",
" 'ccnd3',\n",
" 'cdh1',\n",
" 'cdk12',\n",
" 'cdk4',\n",
" 'cdk6',\n",
" 'cdk8',\n",
" 'cdkn1a',\n",
" 'cdkn1b',\n",
" 'cdkn2a',\n",
" 'cdkn2b',\n",
" 'cebpa',\n",
" 'chek2',\n",
" 'cic',\n",
" 'crebbp',\n",
" 'ctcf',\n",
" 'ctla4',\n",
" 'ctnnb1',\n",
" 'ddr2',\n",
" 'dicer1',\n",
" 'dnmt3a',\n",
" 'dusp4',\n",
" 'egfr',\n",
" 'eif1ax',\n",
" 'elf3',\n",
" 'ep300',\n",
" 'epas1',\n",
" 'erbb2',\n",
" 'erbb3',\n",
" 'erbb4',\n",
" 'ercc2',\n",
" 'ercc3',\n",
" 'ercc4',\n",
" 'erg',\n",
" 'esr1',\n",
" 'etv1',\n",
" 'etv6',\n",
" 'ewsr1',\n",
" 'ezh2',\n",
" 'fam58a',\n",
" 'fanca',\n",
" 'fancc',\n",
" 'fat1',\n",
" 'fbxw7',\n",
" 'fgf19',\n",
" 'fgfr1',\n",
" 'fgfr2',\n",
" 'fgfr3',\n",
" 'fgfr4',\n",
" 'flt1',\n",
" 'flt3',\n",
" 'foxa1',\n",
" 'foxl2',\n",
" 'foxo1',\n",
" 'foxp1',\n",
" 'fubp1',\n",
" 'gata3',\n",
" 'gnaq',\n",
" 'gnas',\n",
" 'h3f3a',\n",
" 'hist1h1c',\n",
" 'hla',\n",
" 'hnf1a',\n",
" 'hras',\n",
" 'idh1',\n",
" 'idh2',\n",
" 'igf1r',\n",
" 'ikbke',\n",
" 'ikzf1',\n",
" 'il7r',\n",
" 'inpp4b',\n",
" 'jak1',\n",
" 'jak2',\n",
" 'jun',\n",
" 'kdm5a',\n",
" 'kdm5c',\n",
" 'kdm6a',\n",
" 'kdr',\n",
" 'keap1',\n",
" 'kit',\n",
" 'klf4',\n",
" 'kmt2a',\n",
" 'kmt2c',\n",
" 'kmt2d',\n",
" 'knstrn',\n",
" 'kras',\n",
" 'map2k1',\n",
" 'map2k2',\n",
" 'map2k4',\n",
" 'map3k1',\n",
" 'mapk1',\n",
" 'mdm4',\n",
" 'med12',\n",
" 'mef2b',\n",
" 'met',\n",
" 'mga',\n",
" 'mlh1',\n",
" 'mpl',\n",
" 'msh2',\n",
" 'msh6',\n",
" 'mtor',\n",
" 'myc',\n",
" 'mycn',\n",
" 'myd88',\n",
" 'ncor1',\n",
" 'nf1',\n",
" 'nf2',\n",
" 'nfe2l2',\n",
" 'nfkbia',\n",
" 'nkx2',\n",
" 'notch1',\n",
" 'npm1',\n",
" 'nras',\n",
" 'nsd1',\n",
" 'ntrk1',\n",
" 'ntrk2',\n",
" 'ntrk3',\n",
" 'nup93',\n",
" 'pak1',\n",
" 'pax8',\n",
" 'pbrm1',\n",
" 'pdgfra',\n",
" 'pdgfrb',\n",
" 'pik3ca',\n",
" 'pik3cb',\n",
" 'pik3cd',\n",
" 'pik3r1',\n",
" 'pik3r2',\n",
" 'pik3r3',\n",
" 'pim1',\n",
" 'pms1',\n",
" 'pms2',\n",
" 'pole',\n",
" 'ppm1d',\n",
" 'ppp2r1a',\n",
" 'ppp6c',\n",
" 'prdm1',\n",
" 'ptch1',\n",
" 'pten',\n",
" 'ptpn11',\n",
" 'ptprd',\n",
" 'ptprt',\n",
" 'rac1',\n",
" 'rad21',\n",
" 'rad50',\n",
" 'rad51b',\n",
" 'rad51c',\n",
" 'rad51d',\n",
" 'rad54l',\n",
" 'raf1',\n",
" 'rara',\n",
" 'rasa1',\n",
" 'rb1',\n",
" 'rbm10',\n",
" 'ret',\n",
" 'rheb',\n",
" 'rhoa',\n",
" 'rictor',\n",
" 'rit1',\n",
" 'rnf43',\n",
" 'ros1',\n",
" 'rras2',\n",
" 'runx1',\n",
" 'rxra',\n",
" 'sdhb',\n",
" 'setd2',\n",
" 'sf3b1',\n",
" 'shoc2',\n",
" 'shq1',\n",
" 'smad2',\n",
" 'smad3',\n",
" 'smad4',\n",
" 'smarca4',\n",
" 'smarcb1',\n",
" 'smo',\n",
" 'sos1',\n",
" 'sox9',\n",
" 'spop',\n",
" 'src',\n",
" 'srsf2',\n",
" 'stag2',\n",
" 'stat3',\n",
" 'stk11',\n",
" 'tert',\n",
" 'tet1',\n",
" 'tet2',\n",
" 'tgfbr1',\n",
" 'tgfbr2',\n",
" 'tmprss2',\n",
" 'tp53',\n",
" 'tp53bp1',\n",
" 'tsc1',\n",
" 'tsc2',\n",
" 'u2af1',\n",
" 'vhl',\n",
" 'whsc1',\n",
" 'whsc1l1',\n",
" 'xpo1',\n",
" 'xrcc2',\n",
" 'yap1']"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#column names after one-hot encoding for Gene column\n",
"gene_vectorizer.get_feature_names()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, lets also create Response encoding columns for Gene column"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {},
"outputs": [],
"source": [
"def get_gv_fea_dict(alpha, feature, df):\n",
" value_count = train_df[feature].value_counts()\n",
" gv_dict = dict()\n",
" for i, denominator in value_count.items():\n",
" vec = []\n",
" for k in range(1,10):\n",
" cls_cnt = train_df.loc[(train_df['Class']==k) & (train_df[feature]==i)]\n",
" vec.append((cls_cnt.shape[0] + alpha*10)/ (denominator + 90*alpha))\n",
" gv_dict[i]=vec\n",
" return gv_dict\n",
"\n",
"# Get Gene variation feature\n",
"def get_gv_feature(alpha, feature, df):\n",
" gv_dict = get_gv_fea_dict(alpha, feature, df)\n",
" value_count = train_df[feature].value_counts()\n",
" gv_fea = []\n",
" for index, row in df.iterrows():\n",
" if row[feature] in dict(value_count).keys():\n",
" gv_fea.append(gv_dict[row[feature]])\n",
" else:\n",
" gv_fea.append([1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9,1/9])\n",
" return gv_fea"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [],
"source": [
"#response-coding of the Gene feature\n",
"# alpha is used for laplace smoothing\n",
"alpha = 1\n",
"# train gene feature\n",
"train_gene_feature_responseCoding = np.array(get_gv_feature(alpha, \"Gene\", train_df))\n",
"# test gene feature\n",
"test_gene_feature_responseCoding = np.array(get_gv_feature(alpha, \"Gene\", test_df))\n",
"# cross validation gene feature\n",
"cv_gene_feature_responseCoding = np.array(get_gv_feature(alpha, \"Gene\", cv_df))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at columns after applying response encoding. We must be having 9 columns for Gene column after response encoding."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2124, 9)"
]
},
"execution_count": 63,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_gene_feature_responseCoding.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, question is how good is Gene column feature to predict my 9 classes. One idea could be that we will build model having only gene column with one hot encoder with simple model like Logistic regression. If log loss with only one column Gene comes out to be better than random model, than this feature is important."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"# We need a hyperparemeter for SGD classifier.\n",
"alpha = [10 ** x for x in range(-5, 1)]"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"For values of alpha = 1e-05 The log loss is: 1.4401833474173025\n",
"For values of alpha = 0.0001 The log loss is: 1.2188767582959434\n",
"For values of alpha = 0.001 The log loss is: 1.229607393327489\n",
"For values of alpha = 0.01 The log loss is: 1.342428066028162\n",
"For values of alpha = 0.1 The log loss is: 1.438846650223064\n",
"For values of alpha = 1 The log loss is: 1.4745241197299315\n"
]
}
],
"source": [
"# We will be using SGD classifier\n",
"cv_log_error_array=[]\n",
"for i in alpha:\n",
" clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)\n",
" clf.fit(train_gene_feature_onehotCoding, y_train)\n",
" sig_clf = CalibratedClassifierCV(clf, method=\"sigmoid\")\n",
" sig_clf.fit(train_gene_feature_onehotCoding, y_train)\n",
" predict_y = sig_clf.predict_proba(cv_gene_feature_onehotCoding)\n",
" cv_log_error_array.append(log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))\n",
" print('For values of alpha = ', i, \"The log loss is:\",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))\n"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Lets plot the same to check the best Alpha value\n",
"fig, ax = plt.subplots()\n",
"ax.plot(alpha, cv_log_error_array,c='g')\n",
"for i, txt in enumerate(np.round(cv_log_error_array,3)):\n",
" ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))\n",
"plt.grid()\n",
"plt.title(\"Cross Validation Error for each alpha\")\n",
"plt.xlabel(\"Alpha i's\")\n",
"plt.ylabel(\"Error measure\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"For values of best alpha = 0.0001 The train log loss is: 1.0577218419057752\n",
"For values of best alpha = 0.0001 The cross validation log loss is: 1.2188767582959434\n",
"For values of best alpha = 0.0001 The test log loss is: 1.1850964820703247\n"
]
}
],
"source": [
"# Lets use best alpha value as we can see from above graph and compute log loss\n",
"best_alpha = np.argmin(cv_log_error_array)\n",
"clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=42)\n",
"clf.fit(train_gene_feature_onehotCoding, y_train)\n",
"sig_clf = CalibratedClassifierCV(clf, method=\"sigmoid\")\n",
"sig_clf.fit(train_gene_feature_onehotCoding, y_train)\n",
"\n",
"predict_y = sig_clf.predict_proba(train_gene_feature_onehotCoding)\n",
"print('For values of best alpha = ', alpha[best_alpha], \"The train log loss is:\",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))\n",
"predict_y = sig_clf.predict_proba(cv_gene_feature_onehotCoding)\n",
"print('For values of best alpha = ', alpha[best_alpha], \"The cross validation log loss is:\",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))\n",
"predict_y = sig_clf.predict_proba(test_gene_feature_onehotCoding)\n",
"print('For values of best alpha = ', alpha[best_alpha], \"The test log loss is:\",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now lets check how many values are overlapping between train, test or between CV and train"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [],
"source": [
"test_coverage=test_df[test_df['Gene'].isin(list(set(train_df['Gene'])))].shape[0]\n",
"cv_coverage=cv_df[cv_df['Gene'].isin(list(set(train_df['Gene'])))].shape[0]"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1. In test data 645 out of 665 : 96.99248120300751\n",
"2. In cross validation data 520 out of 532 : 97.74436090225564\n"
]
}
],
"source": [
"print('1. In test data',test_coverage, 'out of',test_df.shape[0], \":\",(test_coverage/test_df.shape[0])*100)\n",
"print('2. In cross validation data',cv_coverage, 'out of ',cv_df.shape[0],\":\" ,(cv_coverage/cv_df.shape[0])*100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluating Variation column"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Variation is also a categorical variable so we have to deal in same way like we have done for ***Gene*** column. We will again get the one hot encoder and response enoding variable for variation column."
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of Unique Variations : 1932\n",
"Truncating_Mutations 62\n",
"Deletion 47\n",
"Amplification 44\n",
"Fusions 23\n",
"T58I 3\n",
"Overexpression 3\n",
"G12V 3\n",
"T73I 2\n",
"G12C 2\n",
"Q61R 2\n",
"Name: Variation, dtype: int64\n"
]
}
],
"source": [
"unique_variations = train_df['Variation'].value_counts()\n",
"print('Number of Unique Variations :', unique_variations.shape[0])\n",
"# the top 10 variations that occured most\n",
"print(unique_variations.head(10))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets look at the comulative distribution of unique ***variation*** values"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[0.02919021 0.05131827 0.0720339 ... 0.99905838 0.99952919 1. ]\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"s = sum(unique_variations.values);\n",
"h = unique_variations.values/s;\n",
"c = np.cumsum(h)\n",
"print(c)\n",
"plt.plot(c,label='Cumulative distribution of Variations')\n",
"plt.grid()\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets convert the variation column using one hot encoder column"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [],
"source": [
"# one-hot encoding of variation feature.\n",
"variation_vectorizer = CountVectorizer()\n",
"train_variation_feature_onehotCoding = variation_vectorizer.fit_transform(train_df['Variation'])\n",
"test_variation_feature_onehotCoding = variation_vectorizer.transform(test_df['Variation'])\n",
"cv_variation_feature_onehotCoding = variation_vectorizer.transform(cv_df['Variation'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets look at shape of one hot encoder column for variation"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2124, 1965)"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_variation_feature_onehotCoding.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets do the same for variation column and generate response encoding for the same."
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {},
"outputs": [],
"source": [
"# alpha is used for laplace smoothing\n",
"alpha = 1\n",
"# train gene feature\n",
"train_variation_feature_responseCoding = np.array(get_gv_feature(alpha, \"Variation\", train_df))\n",
"# test gene feature\n",
"test_variation_feature_responseCoding = np.array(get_gv_feature(alpha, \"Variation\", test_df))\n",
"# cross validation gene feature\n",
"cv_variation_feature_responseCoding = np.array(get_gv_feature(alpha, \"Variation\", cv_df))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets look at the shape of this response encoding result"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2124, 9)"
]
},
"execution_count": 75,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_variation_feature_responseCoding.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets again build the model with only column name of ***variation*** column"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {},
"outputs": [],
"source": [
"# We need a hyperparemeter for SGD classifier.\n",
"alpha = [10 ** x for x in range(-5, 1)]"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"For values of alpha = 1e-05 The log loss is: 1.70313977051604\n",
"For values of alpha = 0.0001 The log loss is: 1.700413046723949\n",
"For values of alpha = 0.001 The log loss is: 1.7061607868280722\n",
"For values of alpha = 0.01 The log loss is: 1.7116356597023152\n",
"For values of alpha = 0.1 The log loss is: 1.7142090078997774\n",
"For values of alpha = 1 The log loss is: 1.7160127673699193\n"
]
}
],
"source": [
"# We will be using SGD classifier\n",
"cv_log_error_array=[]\n",
"for i in alpha:\n",
" clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)\n",
" clf.fit(train_variation_feature_onehotCoding, y_train)\n",
" \n",
" sig_clf = CalibratedClassifierCV(clf, method=\"sigmoid\")\n",
" sig_clf.fit(train_variation_feature_onehotCoding, y_train)\n",
" predict_y = sig_clf.predict_proba(cv_variation_feature_onehotCoding)\n",
" \n",
" cv_log_error_array.append(log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))\n",
" print('For values of alpha = ', i, \"The log loss is:\",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# Lets plot the same to check the best Alpha value\n",
"fig, ax = plt.subplots()\n",
"ax.plot(alpha, cv_log_error_array,c='g')\n",
"for i, txt in enumerate(np.round(cv_log_error_array,3)):\n",
" ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))\n",
"plt.grid()\n",
"plt.title(\"Cross Validation Error for each alpha\")\n",
"plt.xlabel(\"Alpha i's\")\n",
"plt.ylabel(\"Error measure\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"For values of best alpha = 0.0001 The train log loss is: 0.6986288720503351\n",
"For values of best alpha = 0.0001 The cross validation log loss is: 1.700413046723949\n",
"For values of best alpha = 0.0001 The test log loss is: 1.7200311395645975\n"
]
}
],
"source": [
"best_alpha = np.argmin(cv_log_error_array)\n",
"clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=42)\n",
"clf.fit(train_variation_feature_onehotCoding, y_train)\n",
"sig_clf = CalibratedClassifierCV(clf, method=\"sigmoid\")\n",
"sig_clf.fit(train_variation_feature_onehotCoding, y_train)\n",
"\n",
"predict_y = sig_clf.predict_proba(train_variation_feature_onehotCoding)\n",
"print('For values of best alpha = ', alpha[best_alpha], \"The train log loss is:\",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))\n",
"predict_y = sig_clf.predict_proba(cv_variation_feature_onehotCoding)\n",
"print('For values of best alpha = ', alpha[best_alpha], \"The cross validation log loss is:\",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))\n",
"predict_y = sig_clf.predict_proba(test_variation_feature_onehotCoding)\n",
"print('For values of best alpha = ', alpha[best_alpha], \"The test log loss is:\",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {},
"outputs": [],
"source": [
"test_coverage=test_df[test_df['Variation'].isin(list(set(train_df['Variation'])))].shape[0]\n",
"cv_coverage=cv_df[cv_df['Variation'].isin(list(set(train_df['Variation'])))].shape[0]"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1. In test data 69 out of 665 : 10.37593984962406\n",
"2. In cross validation data 58 out of 532 : 10.902255639097744\n"
]
}
],
"source": [
"print('1. In test data',test_coverage, 'out of',test_df.shape[0], \":\",(test_coverage/test_df.shape[0])*100)\n",
"print('2. In cross validation data',cv_coverage, 'out of ',cv_df.shape[0],\":\" ,(cv_coverage/cv_df.shape[0])*100)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluating Text column"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {},
"outputs": [],
"source": [
"def extract_dictionary_paddle(cls_text):\n",
" dictionary = defaultdict(int)\n",
" for index, row in cls_text.iterrows():\n",
" for word in row['TEXT'].split():\n",
" dictionary[word] +=1\n",
" return dictionary"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {},
"outputs": [],
"source": [
"import math\n",
"def get_text_responsecoding(df):\n",
" text_feature_responseCoding = np.zeros((df.shape[0],9))\n",
" for i in range(0,9):\n",
" row_index = 0\n",
" for index, row in df.iterrows():\n",
" sum_prob = 0\n",
" for word in row['TEXT'].split():\n",
" sum_prob += math.log(((dict_list[i].get(word,0)+10 )/(total_dict.get(word,0)+90)))\n",
" text_feature_responseCoding[row_index][i] = math.exp(sum_prob/len(row['TEXT'].split()))\n",
" row_index += 1\n",
" return text_feature_responseCoding"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of unique words in train data : 54911\n"
]
}
],
"source": [
"# building a CountVectorizer with all the words that occured minimum 3 times in train data\n",
"text_vectorizer = CountVectorizer(min_df=3)\n",
"train_text_feature_onehotCoding = text_vectorizer.fit_transform(train_df['TEXT'])\n",
"# getting all the feature names (words)\n",
"train_text_features= text_vectorizer.get_feature_names()\n",
"\n",
"train_text_fea_counts = train_text_feature_onehotCoding.sum(axis=0).A1\n",
"\n",
"text_fea_dict = dict(zip(list(train_text_features),train_text_fea_counts))\n",
"\n",
"\n",
"print(\"Total number of unique words in train data :\", len(train_text_features))"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {},
"outputs": [],
"source": [
"dict_list = []\n",
"# dict_list =[] contains 9 dictoinaries each corresponds to a class\n",
"for i in range(1,10):\n",
" cls_text = train_df[train_df['Class']==i]\n",
" dict_list.append(extract_dictionary_paddle(cls_text))\n",
"\n",
"total_dict = extract_dictionary_paddle(train_df)\n",
"\n",
"\n",
"confuse_array = []\n",
"for i in train_text_features:\n",
" ratios = []\n",
" max_val = -1\n",
" for j in range(0,9):\n",
" ratios.append((dict_list[j][i]+10 )/(total_dict[i]+90))\n",
" confuse_array.append(ratios)\n",
"confuse_array = np.array(confuse_array)"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [],
"source": [
"#response coding of text features\n",
"train_text_feature_responseCoding = get_text_responsecoding(train_df)\n",
"test_text_feature_responseCoding = get_text_responsecoding(test_df)\n",
"cv_text_feature_responseCoding = get_text_responsecoding(cv_df)"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [],
"source": [
"# we convert each row values such that they sum to 1 \n",
"train_text_feature_responseCoding = (train_text_feature_responseCoding.T/train_text_feature_responseCoding.sum(axis=1)).T\n",
"test_text_feature_responseCoding = (test_text_feature_responseCoding.T/test_text_feature_responseCoding.sum(axis=1)).T\n",
"cv_text_feature_responseCoding = (cv_text_feature_responseCoding.T/cv_text_feature_responseCoding.sum(axis=1)).T"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [],
"source": [
"# don't forget to normalize every feature\n",
"train_text_feature_onehotCoding = normalize(train_text_feature_onehotCoding, axis=0)\n",
"\n",
"# we use the same vectorizer that was trained on train data\n",
"test_text_feature_onehotCoding = text_vectorizer.transform(test_df['TEXT'])\n",
"test_text_feature_onehotCoding = normalize(test_text_feature_onehotCoding, axis=0)\n",
"\n",
"# we use the same vectorizer that was trained on train data\n",
"cv_text_feature_onehotCoding = text_vectorizer.transform(cv_df['TEXT'])\n",
"cv_text_feature_onehotCoding = normalize(cv_text_feature_onehotCoding, axis=0)"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [],
"source": [
"sorted_text_fea_dict = dict(sorted(text_fea_dict.items(), key=lambda x: x[1] , reverse=True))\n",
"sorted_text_occur = np.array(list(sorted_text_fea_dict.values()))"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Counter({3: 5437, 4: 4242, 5: 3096, 6: 3060, 8: 2118, 7: 1911, 9: 1821, 12: 1310, 11: 1299, 10: 1269, 14: 1024, 15: 1016, 13: 978, 16: 932, 18: 660, 20: 658, 17: 623, 21: 540, 24: 508, 19: 491, 22: 470, 23: 431, 26: 405, 28: 375, 30: 369, 27: 366, 25: 359, 41: 327, 48: 318, 33: 305, 29: 278, 40: 276, 36: 276, 32: 260, 35: 259, 37: 251, 31: 240, 34: 230, 39: 212, 42: 204, 38: 203, 45: 176, 51: 175, 44: 173, 43: 158, 56: 155, 52: 155, 49: 149, 58: 148, 46: 146, 54: 145, 47: 143, 60: 139, 50: 139, 57: 136, 53: 134, 61: 124, 66: 121, 63: 121, 70: 119, 55: 119, 65: 114, 69: 111, 62: 111, 64: 104, 68: 99, 72: 96, 59: 95, 73: 91, 77: 87, 67: 87, 84: 86, 74: 86, 75: 83, 78: 81, 80: 80, 76: 78, 83: 77, 71: 76, 93: 74, 96: 73, 87: 73, 92: 71, 79: 71, 90: 70, 85: 70, 102: 68, 82: 68, 120: 66, 88: 65, 105: 64, 99: 62, 95: 62, 91: 62, 86: 62, 81: 60, 100: 56, 97: 56, 119: 54, 112: 54, 89: 54, 124: 52, 108: 52, 98: 52, 113: 51, 94: 51, 110: 50, 104: 49, 128: 47, 106: 47, 103: 47, 107: 46, 126: 44, 101: 43, 144: 42, 116: 42, 141: 41, 131: 41, 122: 41, 111: 41, 109: 41, 152: 40, 125: 40, 121: 40, 158: 38, 154: 38, 138: 38, 135: 38, 130: 38, 143: 37, 117: 37, 140: 36, 123: 36, 132: 35, 114: 35, 115: 34, 173: 33, 157: 33, 151: 33, 137: 33, 156: 32, 150: 32, 136: 32, 170: 31, 155: 31, 149: 31, 146: 31, 145: 31, 133: 31, 127: 31, 168: 30, 134: 30, 118: 30, 187: 29, 177: 29, 166: 29, 191: 28, 181: 28, 176: 28, 164: 28, 160: 28, 192: 27, 142: 27, 139: 27, 129: 27, 148: 26, 199: 25, 186: 25, 179: 25, 169: 25, 198: 24, 184: 24, 162: 24, 153: 24, 147: 24, 205: 23, 197: 23, 178: 23, 165: 23, 253: 22, 228: 22, 203: 22, 195: 22, 163: 22, 243: 21, 222: 21, 221: 21, 207: 21, 200: 21, 196: 21, 190: 21, 183: 21, 180: 21, 167: 21, 159: 21, 245: 20, 244: 20, 219: 20, 215: 20, 209: 20, 202: 20, 194: 20, 188: 20, 172: 20, 276: 19, 218: 19, 204: 19, 344: 18, 261: 18, 216: 18, 214: 18, 212: 18, 189: 18, 175: 18, 174: 18, 161: 18, 383: 17, 309: 17, 288: 17, 268: 17, 264: 17, 252: 17, 250: 17, 237: 17, 220: 17, 217: 17, 206: 17, 182: 17, 372: 16, 298: 16, 267: 16, 256: 16, 240: 16, 227: 16, 211: 16, 208: 16, 185: 16, 171: 16, 294: 15, 289: 15, 280: 15, 270: 15, 265: 15, 260: 15, 238: 15, 234: 15, 230: 15, 226: 15, 223: 15, 201: 15, 326: 14, 300: 14, 299: 14, 297: 14, 263: 14, 258: 14, 251: 14, 246: 14, 229: 14, 225: 14, 224: 14, 213: 14, 210: 14, 358: 13, 342: 13, 335: 13, 317: 13, 314: 13, 304: 13, 285: 13, 283: 13, 277: 13, 273: 13, 266: 13, 255: 13, 248: 13, 242: 13, 474: 12, 341: 12, 332: 12, 328: 12, 322: 12, 312: 12, 308: 12, 302: 12, 286: 12, 282: 12, 279: 12, 272: 12, 254: 12, 239: 12, 231: 12, 387: 11, 370: 11, 331: 11, 323: 11, 287: 11, 284: 11, 274: 11, 241: 11, 233: 11, 193: 11, 463: 10, 453: 10, 425: 10, 404: 10, 393: 10, 376: 10, 373: 10, 371: 10, 347: 10, 340: 10, 336: 10, 330: 10, 318: 10, 316: 10, 311: 10, 306: 10, 301: 10, 292: 10, 291: 10, 278: 10, 269: 10, 257: 10, 249: 10, 247: 10, 232: 10, 528: 9, 475: 9, 457: 9, 455: 9, 435: 9, 428: 9, 414: 9, 384: 9, 381: 9, 356: 9, 353: 9, 351: 9, 345: 9, 338: 9, 333: 9, 321: 9, 310: 9, 307: 9, 305: 9, 281: 9, 262: 9, 236: 9, 835: 8, 653: 8, 542: 8, 518: 8, 462: 8, 458: 8, 444: 8, 437: 8, 427: 8, 413: 8, 412: 8, 400: 8, 396: 8, 391: 8, 382: 8, 375: 8, 369: 8, 363: 8, 360: 8, 355: 8, 354: 8, 349: 8, 348: 8, 324: 8, 319: 8, 315: 8, 295: 8, 290: 8, 271: 8, 932: 7, 829: 7, 771: 7, 680: 7, 676: 7, 649: 7, 624: 7, 606: 7, 598: 7, 563: 7, 558: 7, 543: 7, 523: 7, 512: 7, 493: 7, 492: 7, 491: 7, 484: 7, 476: 7, 473: 7, 467: 7, 465: 7, 459: 7, 448: 7, 447: 7, 446: 7, 440: 7, 439: 7, 431: 7, 409: 7, 407: 7, 401: 7, 397: 7, 395: 7, 388: 7, 374: 7, 365: 7, 364: 7, 350: 7, 337: 7, 327: 7, 296: 7, 293: 7, 235: 7, 820: 6, 708: 6, 697: 6, 686: 6, 678: 6, 675: 6, 617: 6, 610: 6, 601: 6, 597: 6, 589: 6, 587: 6, 570: 6, 568: 6, 567: 6, 566: 6, 564: 6, 549: 6, 539: 6, 536: 6, 530: 6, 524: 6, 521: 6, 509: 6, 506: 6, 488: 6, 485: 6, 479: 6, 470: 6, 460: 6, 456: 6, 441: 6, 432: 6, 430: 6, 426: 6, 424: 6, 423: 6, 422: 6, 403: 6, 398: 6, 390: 6, 389: 6, 379: 6, 378: 6, 366: 6, 361: 6, 357: 6, 343: 6, 339: 6, 334: 6, 329: 6, 325: 6, 313: 6, 259: 6, 1366: 5, 978: 5, 954: 5, 919: 5, 897: 5, 825: 5, 793: 5, 772: 5, 763: 5, 702: 5, 699: 5, 685: 5, 683: 5, 665: 5, 656: 5, 651: 5, 647: 5, 636: 5, 630: 5, 621: 5, 615: 5, 582: 5, 580: 5, 575: 5, 565: 5, 562: 5, 561: 5, 557: 5, 553: 5, 551: 5, 541: 5, 531: 5, 529: 5, 526: 5, 514: 5, 507: 5, 500: 5, 499: 5, 489: 5, 487: 5, 480: 5, 464: 5, 454: 5, 452: 5, 438: 5, 418: 5, 417: 5, 416: 5, 410: 5, 406: 5, 405: 5, 392: 5, 385: 5, 380: 5, 367: 5, 362: 5, 346: 5, 303: 5, 275: 5, 3606: 4, 1711: 4, 1642: 4, 1612: 4, 1603: 4, 1403: 4, 1320: 4, 1291: 4, 1245: 4, 1240: 4, 1209: 4, 1192: 4, 1168: 4, 1153: 4, 1142: 4, 1115: 4, 1101: 4, 1070: 4, 1059: 4, 1055: 4, 1000: 4, 994: 4, 981: 4, 972: 4, 963: 4, 950: 4, 948: 4, 937: 4, 929: 4, 912: 4, 911: 4, 909: 4, 907: 4, 903: 4, 899: 4, 888: 4, 887: 4, 882: 4, 864: 4, 861: 4, 854: 4, 848: 4, 818: 4, 817: 4, 806: 4, 794: 4, 786: 4, 782: 4, 768: 4, 766: 4, 765: 4, 753: 4, 741: 4, 731: 4, 729: 4, 726: 4, 724: 4, 720: 4, 709: 4, 704: 4, 687: 4, 681: 4, 679: 4, 670: 4, 667: 4, 650: 4, 646: 4, 642: 4, 638: 4, 635: 4, 634: 4, 632: 4, 631: 4, 622: 4, 604: 4, 602: 4, 596: 4, 593: 4, 591: 4, 588: 4, 579: 4, 576: 4, 573: 4, 560: 4, 554: 4, 550: 4, 548: 4, 546: 4, 544: 4, 540: 4, 538: 4, 537: 4, 534: 4, 520: 4, 513: 4, 511: 4, 510: 4, 503: 4, 502: 4, 497: 4, 496: 4, 490: 4, 483: 4, 482: 4, 472: 4, 468: 4, 451: 4, 450: 4, 449: 4, 445: 4, 442: 4, 436: 4, 434: 4, 433: 4, 421: 4, 415: 4, 411: 4, 399: 4, 386: 4, 3443: 3, 3268: 3, 3227: 3, 2651: 3, 2440: 3, 2413: 3, 2240: 3, 2238: 3, 2186: 3, 2089: 3, 2073: 3, 2030: 3, 2006: 3, 2003: 3, 1973: 3, 1920: 3, 1919: 3, 1814: 3, 1811: 3, 1788: 3, 1746: 3, 1707: 3, 1702: 3, 1701: 3, 1671: 3, 1632: 3, 1624: 3, 1620: 3, 1615: 3, 1597: 3, 1584: 3, 1582: 3, 1508: 3, 1470: 3, 1466: 3, 1443: 3, 1425: 3, 1363: 3, 1357: 3, 1330: 3, 1324: 3, 1323: 3, 1319: 3, 1295: 3, 1294: 3, 1285: 3, 1281: 3, 1278: 3, 1259: 3, 1235: 3, 1227: 3, 1214: 3, 1213: 3, 1201: 3, 1164: 3, 1163: 3, 1158: 3, 1157: 3, 1145: 3, 1105: 3, 1095: 3, 1088: 3, 1065: 3, 1061: 3, 1056: 3, 1035: 3, 1031: 3, 1021: 3, 1009: 3, 998: 3, 992: 3, 959: 3, 958: 3, 957: 3, 943: 3, 938: 3, 906: 3, 904: 3, 893: 3, 889: 3, 884: 3, 879: 3, 874: 3, 868: 3, 867: 3, 863: 3, 862: 3, 859: 3, 857: 3, 852: 3, 849: 3, 846: 3, 845: 3, 841: 3, 836: 3, 814: 3, 805: 3, 800: 3, 792: 3, 789: 3, 787: 3, 783: 3, 781: 3, 780: 3, 779: 3, 776: 3, 756: 3, 755: 3, 750: 3, 749: 3, 748: 3, 747: 3, 745: 3, 743: 3, 742: 3, 736: 3, 734: 3, 727: 3, 723: 3, 719: 3, 717: 3, 715: 3, 711: 3, 707: 3, 705: 3, 694: 3, 692: 3, 691: 3, 689: 3, 682: 3, 668: 3, 664: 3, 662: 3, 661: 3, 657: 3, 655: 3, 648: 3, 645: 3, 639: 3, 633: 3, 627: 3, 626: 3, 619: 3, 618: 3, 616: 3, 613: 3, 609: 3, 608: 3, 607: 3, 605: 3, 600: 3, 592: 3, 584: 3, 581: 3, 578: 3, 574: 3, 571: 3, 559: 3, 555: 3, 552: 3, 532: 3, 527: 3, 525: 3, 522: 3, 516: 3, 515: 3, 495: 3, 494: 3, 481: 3, 471: 3, 469: 3, 466: 3, 461: 3, 429: 3, 420: 3, 419: 3, 394: 3, 368: 3, 359: 3, 352: 3, 320: 3, 13784: 2, 12501: 2, 12156: 2, 10059: 2, 7190: 2, 7084: 2, 6695: 2, 6039: 2, 5985: 2, 5466: 2, 5062: 2, 4901: 2, 4401: 2, 4298: 2, 4153: 2, 3933: 2, 3831: 2, 3704: 2, 3664: 2, 3645: 2, 3574: 2, 3547: 2, 3539: 2, 3496: 2, 3489: 2, 3405: 2, 3274: 2, 3220: 2, 3211: 2, 3203: 2, 3192: 2, 3179: 2, 3093: 2, 3021: 2, 3014: 2, 3011: 2, 2947: 2, 2907: 2, 2892: 2, 2840: 2, 2831: 2, 2769: 2, 2747: 2, 2740: 2, 2733: 2, 2705: 2, 2688: 2, 2686: 2, 2663: 2, 2639: 2, 2617: 2, 2609: 2, 2607: 2, 2585: 2, 2533: 2, 2507: 2, 2464: 2, 2446: 2, 2435: 2, 2430: 2, 2426: 2, 2410: 2, 2393: 2, 2386: 2, 2362: 2, 2350: 2, 2349: 2, 2339: 2, 2314: 2, 2307: 2, 2306: 2, 2270: 2, 2269: 2, 2267: 2, 2256: 2, 2255: 2, 2232: 2, 2212: 2, 2195: 2, 2185: 2, 2166: 2, 2163: 2, 2162: 2, 2155: 2, 2144: 2, 2141: 2, 2128: 2, 2107: 2, 2098: 2, 2097: 2, 2077: 2, 2075: 2, 2068: 2, 2061: 2, 2060: 2, 2031: 2, 2021: 2, 2020: 2, 2005: 2, 1965: 2, 1958: 2, 1955: 2, 1937: 2, 1934: 2, 1928: 2, 1926: 2, 1917: 2, 1908: 2, 1905: 2, 1890: 2, 1878: 2, 1874: 2, 1872: 2, 1868: 2, 1851: 2, 1848: 2, 1845: 2, 1836: 2, 1835: 2, 1820: 2, 1808: 2, 1807: 2, 1802: 2, 1801: 2, 1797: 2, 1796: 2, 1792: 2, 1779: 2, 1774: 2, 1767: 2, 1744: 2, 1713: 2, 1693: 2, 1684: 2, 1682: 2, 1673: 2, 1653: 2, 1647: 2, 1645: 2, 1644: 2, 1634: 2, 1616: 2, 1609: 2, 1602: 2, 1600: 2, 1583: 2, 1580: 2, 1576: 2, 1569: 2, 1568: 2, 1567: 2, 1563: 2, 1555: 2, 1551: 2, 1550: 2, 1546: 2, 1543: 2, 1537: 2, 1536: 2, 1530: 2, 1521: 2, 1518: 2, 1516: 2, 1515: 2, 1510: 2, 1492: 2, 1482: 2, 1473: 2, 1469: 2, 1461: 2, 1459: 2, 1452: 2, 1450: 2, 1449: 2, 1437: 2, 1436: 2, 1432: 2, 1422: 2, 1404: 2, 1401: 2, 1396: 2, 1394: 2, 1391: 2, 1390: 2, 1383: 2, 1378: 2, 1377: 2, 1376: 2, 1373: 2, 1372: 2, 1367: 2, 1364: 2, 1359: 2, 1355: 2, 1353: 2, 1351: 2, 1345: 2, 1344: 2, 1335: 2, 1329: 2, 1321: 2, 1317: 2, 1316: 2, 1314: 2, 1303: 2, 1301: 2, 1300: 2, 1298: 2, 1289: 2, 1284: 2, 1283: 2, 1277: 2, 1275: 2, 1274: 2, 1273: 2, 1271: 2, 1268: 2, 1265: 2, 1264: 2, 1263: 2, 1262: 2, 1261: 2, 1254: 2, 1251: 2, 1250: 2, 1249: 2, 1241: 2, 1236: 2, 1232: 2, 1231: 2, 1230: 2, 1229: 2, 1228: 2, 1223: 2, 1212: 2, 1200: 2, 1198: 2, 1196: 2, 1195: 2, 1190: 2, 1183: 2, 1182: 2, 1181: 2, 1180: 2, 1179: 2, 1174: 2, 1172: 2, 1171: 2, 1170: 2, 1162: 2, 1159: 2, 1152: 2, 1150: 2, 1138: 2, 1136: 2, 1135: 2, 1132: 2, 1129: 2, 1128: 2, 1125: 2, 1116: 2, 1114: 2, 1112: 2, 1110: 2, 1109: 2, 1104: 2, 1100: 2, 1098: 2, 1097: 2, 1094: 2, 1087: 2, 1086: 2, 1081: 2, 1080: 2, 1073: 2, 1064: 2, 1058: 2, 1057: 2, 1051: 2, 1049: 2, 1048: 2, 1045: 2, 1043: 2, 1042: 2, 1034: 2, 1032: 2, 1030: 2, 1028: 2, 1027: 2, 1025: 2, 1023: 2, 1020: 2, 1019: 2, 1016: 2, 1012: 2, 1005: 2, 996: 2, 995: 2, 993: 2, 991: 2, 989: 2, 988: 2, 986: 2, 983: 2, 979: 2, 976: 2, 975: 2, 967: 2, 962: 2, 961: 2, 960: 2, 956: 2, 955: 2, 952: 2, 951: 2, 946: 2, 945: 2, 944: 2, 939: 2, 934: 2, 933: 2, 931: 2, 928: 2, 920: 2, 917: 2, 915: 2, 905: 2, 900: 2, 896: 2, 892: 2, 891: 2, 890: 2, 886: 2, 883: 2, 881: 2, 880: 2, 878: 2, 873: 2, 870: 2, 869: 2, 866: 2, 858: 2, 856: 2, 851: 2, 847: 2, 840: 2, 828: 2, 826: 2, 824: 2, 823: 2, 822: 2, 819: 2, 812: 2, 810: 2, 801: 2, 796: 2, 795: 2, 790: 2, 785: 2, 777: 2, 774: 2, 773: 2, 770: 2, 764: 2, 762: 2, 758: 2, 757: 2, 754: 2, 746: 2, 739: 2, 735: 2, 728: 2, 713: 2, 712: 2, 703: 2, 698: 2, 696: 2, 695: 2, 693: 2, 690: 2, 688: 2, 677: 2, 673: 2, 672: 2, 660: 2, 659: 2, 654: 2, 652: 2, 643: 2, 640: 2, 637: 2, 629: 2, 623: 2, 614: 2, 612: 2, 611: 2, 599: 2, 595: 2, 586: 2, 583: 2, 577: 2, 572: 2, 556: 2, 547: 2, 533: 2, 519: 2, 517: 2, 508: 2, 498: 2, 486: 2, 478: 2, 443: 2, 408: 2, 402: 2, 377: 2, 149668: 1, 118498: 1, 82148: 1, 68062: 1, 67485: 1, 67361: 1, 66291: 1, 63301: 1, 63068: 1, 55028: 1, 54851: 1, 49776: 1, 48029: 1, 46759: 1, 46214: 1, 43205: 1, 43134: 1, 42525: 1, 41701: 1, 41060: 1, 40777: 1, 40041: 1, 39895: 1, 38880: 1, 38810: 1, 38030: 1, 36608: 1, 36490: 1, 35928: 1, 33558: 1, 33505: 1, 33412: 1, 33156: 1, 32247: 1, 32038: 1, 31203: 1, 29399: 1, 29211: 1, 27970: 1, 26943: 1, 26410: 1, 26063: 1, 25824: 1, 25568: 1, 25291: 1, 24725: 1, 24441: 1, 24347: 1, 24300: 1, 23880: 1, 23760: 1, 23511: 1, 22778: 1, 22314: 1, 22118: 1, 21931: 1, 21896: 1, 21836: 1, 21558: 1, 21163: 1, 20885: 1, 20211: 1, 20131: 1, 20058: 1, 19498: 1, 19447: 1, 19238: 1, 19060: 1, 18979: 1, 18966: 1, 18924: 1, 18898: 1, 18838: 1, 18834: 1, 18652: 1, 18230: 1, 18194: 1, 18090: 1, 18077: 1, 18013: 1, 17931: 1, 17742: 1, 17664: 1, 17516: 1, 17510: 1, 17438: 1, 17428: 1, 17126: 1, 17074: 1, 16993: 1, 16986: 1, 16923: 1, 16899: 1, 16832: 1, 16571: 1, 16488: 1, 16116: 1, 15995: 1, 15941: 1, 15930: 1, 15795: 1, 15792: 1, 15597: 1, 15506: 1, 15457: 1, 15375: 1, 15278: 1, 15276: 1, 15243: 1, 15095: 1, 15057: 1, 15027: 1, 14967: 1, 14912: 1, 14757: 1, 14688: 1, 14618: 1, 14576: 1, 14074: 1, 14016: 1, 13917: 1, 13769: 1, 13620: 1, 13455: 1, 13405: 1, 13330: 1, 13279: 1, 13229: 1, 13091: 1, 13022: 1, 13001: 1, 12967: 1, 12889: 1, 12876: 1, 12832: 1, 12791: 1, 12743: 1, 12734: 1, 12688: 1, 12658: 1, 12653: 1, 12632: 1, 12600: 1, 12538: 1, 12527: 1, 12458: 1, 12452: 1, 12448: 1, 12368: 1, 12361: 1, 12353: 1, 12311: 1, 12286: 1, 12275: 1, 12203: 1, 12178: 1, 12097: 1, 12056: 1, 12038: 1, 11957: 1, 11918: 1, 11900: 1, 11894: 1, 11779: 1, 11723: 1, 11688: 1, 11685: 1, 11582: 1, 11581: 1, 11525: 1, 11519: 1, 11303: 1, 11211: 1, 11205: 1, 11029: 1, 10995: 1, 10900: 1, 10894: 1, 10810: 1, 10784: 1, 10783: 1, 10659: 1, 10650: 1, 10614: 1, 10476: 1, 10474: 1, 10360: 1, 10319: 1, 10247: 1, 10171: 1, 10098: 1, 10056: 1, 10033: 1, 9985: 1, 9976: 1, 9954: 1, 9948: 1, 9945: 1, 9922: 1, 9907: 1, 9814: 1, 9802: 1, 9742: 1, 9687: 1, 9612: 1, 9601: 1, 9528: 1, 9522: 1, 9433: 1, 9428: 1, 9423: 1, 9402: 1, 9396: 1, 9364: 1, 9343: 1, 9285: 1, 9266: 1, 9194: 1, 9179: 1, 9178: 1, 9170: 1, 9166: 1, 9160: 1, 9116: 1, 9114: 1, 9081: 1, 9058: 1, 9013: 1, 8893: 1, 8885: 1, 8881: 1, 8873: 1, 8830: 1, 8825: 1, 8817: 1, 8773: 1, 8746: 1, 8720: 1, 8691: 1, 8610: 1, 8523: 1, 8476: 1, 8456: 1, 8440: 1, 8433: 1, 8423: 1, 8415: 1, 8389: 1, 8379: 1, 8378: 1, 8365: 1, 8360: 1, 8312: 1, 8300: 1, 8261: 1, 8239: 1, 8234: 1, 8162: 1, 8142: 1, 8113: 1, 8109: 1, 8060: 1, 8057: 1, 8046: 1, 8045: 1, 8001: 1, 7984: 1, 7981: 1, 7965: 1, 7945: 1, 7940: 1, 7895: 1, 7894: 1, 7874: 1, 7834: 1, 7767: 1, 7713: 1, 7676: 1, 7675: 1, 7666: 1, 7658: 1, 7646: 1, 7612: 1, 7571: 1, 7567: 1, 7565: 1, 7532: 1, 7521: 1, 7514: 1, 7502: 1, 7491: 1, 7362: 1, 7344: 1, 7340: 1, 7325: 1, 7322: 1, 7319: 1, 7306: 1, 7289: 1, 7286: 1, 7283: 1, 7276: 1, 7217: 1, 7196: 1, 7191: 1, 7182: 1, 7179: 1, 7173: 1, 7140: 1, 7121: 1, 7113: 1, 7101: 1, 7098: 1, 7073: 1, 7014: 1, 7005: 1, 6980: 1, 6967: 1, 6938: 1, 6932: 1, 6893: 1, 6891: 1, 6875: 1, 6870: 1, 6859: 1, 6858: 1, 6818: 1, 6815: 1, 6795: 1, 6758: 1, 6741: 1, 6724: 1, 6721: 1, 6720: 1, 6703: 1, 6696: 1, 6682: 1, 6676: 1, 6650: 1, 6633: 1, 6627: 1, 6608: 1, 6584: 1, 6583: 1, 6557: 1, 6521: 1, 6499: 1, 6474: 1, 6473: 1, 6440: 1, 6438: 1, 6427: 1, 6421: 1, 6406: 1, 6400: 1, 6385: 1, 6373: 1, 6369: 1, 6365: 1, 6354: 1, 6341: 1, 6332: 1, 6324: 1, 6317: 1, 6288: 1, 6279: 1, 6278: 1, 6247: 1, 6239: 1, 6236: 1, 6214: 1, 6211: 1, 6206: 1, 6160: 1, 6147: 1, 6144: 1, 6142: 1, 6128: 1, 6120: 1, 6080: 1, 6048: 1, 6030: 1, 5983: 1, 5925: 1, 5924: 1, 5915: 1, 5905: 1, 5898: 1, 5890: 1, 5887: 1, 5886: 1, 5859: 1, 5855: 1, 5832: 1, 5805: 1, 5794: 1, 5786: 1, 5784: 1, 5770: 1, 5763: 1, 5748: 1, 5742: 1, 5719: 1, 5709: 1, 5704: 1, 5698: 1, 5686: 1, 5662: 1, 5641: 1, 5639: 1, 5635: 1, 5629: 1, 5620: 1, 5596: 1, 5573: 1, 5568: 1, 5564: 1, 5556: 1, 5547: 1, 5546: 1, 5530: 1, 5504: 1, 5478: 1, 5462: 1, 5454: 1, 5445: 1, 5433: 1, 5410: 1, 5397: 1, 5391: 1, 5379: 1, 5360: 1, 5354: 1, 5347: 1, 5335: 1, 5334: 1, 5320: 1, 5315: 1, 5313: 1, 5308: 1, 5307: 1, 5303: 1, 5281: 1, 5271: 1, 5266: 1, 5238: 1, 5233: 1, 5221: 1, 5209: 1, 5206: 1, 5192: 1, 5173: 1, 5158: 1, 5133: 1, 5124: 1, 5122: 1, 5089: 1, 5088: 1, 5082: 1, 5076: 1, 5073: 1, 5061: 1, 5059: 1, 5058: 1, 5040: 1, 5033: 1, 5009: 1, 5005: 1, 4985: 1, 4973: 1, 4962: 1, 4955: 1, 4953: 1, 4949: 1, 4941: 1, 4938: 1, 4930: 1, 4919: 1, 4913: 1, 4899: 1, 4877: 1, 4869: 1, 4862: 1, 4847: 1, 4825: 1, 4817: 1, 4812: 1, 4803: 1, 4798: 1, 4791: 1, 4787: 1, 4778: 1, 4774: 1, 4766: 1, 4761: 1, 4743: 1, 4742: 1, 4739: 1, 4737: 1, 4724: 1, 4710: 1, 4708: 1, 4694: 1, 4686: 1, 4683: 1, 4682: 1, 4671: 1, 4668: 1, 4649: 1, 4640: 1, 4637: 1, 4633: 1, 4625: 1, 4600: 1, 4592: 1, 4559: 1, 4549: 1, 4546: 1, 4543: 1, 4540: 1, 4538: 1, 4515: 1, 4511: 1, 4503: 1, 4494: 1, 4493: 1, 4491: 1, 4490: 1, 4462: 1, 4449: 1, 4440: 1, 4432: 1, 4430: 1, 4428: 1, 4426: 1, 4424: 1, 4410: 1, 4396: 1, 4393: 1, 4390: 1, 4384: 1, 4376: 1, 4359: 1, 4353: 1, 4347: 1, 4344: 1, 4342: 1, 4323: 1, 4304: 1, 4301: 1, 4295: 1, 4293: 1, 4292: 1, 4288: 1, 4277: 1, 4276: 1, 4273: 1, 4248: 1, 4247: 1, 4244: 1, 4242: 1, 4241: 1, 4234: 1, 4227: 1, 4206: 1, 4205: 1, 4200: 1, 4198: 1, 4196: 1, 4193: 1, 4185: 1, 4176: 1, 4172: 1, 4171: 1, 4167: 1, 4162: 1, 4155: 1, 4151: 1, 4141: 1, 4137: 1, 4134: 1, 4129: 1, 4115: 1, 4113: 1, 4101: 1, 4098: 1, 4095: 1, 4090: 1, 4089: 1, 4087: 1, 4086: 1, 4082: 1, 4079: 1, 4078: 1, 4072: 1, 4065: 1, 4060: 1, 4053: 1, 4043: 1, 4041: 1, 4033: 1, 4029: 1, 4025: 1, 4006: 1, 4004: 1, 3998: 1, 3997: 1, 3995: 1, 3991: 1, 3990: 1, 3981: 1, 3972: 1, 3969: 1, 3965: 1, 3961: 1, 3946: 1, 3942: 1, 3931: 1, 3929: 1, 3919: 1, 3911: 1, 3907: 1, 3902: 1, 3894: 1, 3893: 1, 3889: 1, 3876: 1, 3875: 1, 3861: 1, 3859: 1, 3856: 1, 3847: 1, 3835: 1, 3829: 1, 3826: 1, 3819: 1, 3818: 1, 3813: 1, 3810: 1, 3803: 1, 3802: 1, 3801: 1, 3792: 1, 3775: 1, 3767: 1, 3765: 1, 3755: 1, 3749: 1, 3745: 1, 3742: 1, 3738: 1, 3737: 1, 3729: 1, 3724: 1, 3720: 1, 3709: 1, 3707: 1, 3699: 1, 3692: 1, 3688: 1, 3668: 1, 3663: 1, 3653: 1, 3641: 1, 3640: 1, 3636: 1, 3633: 1, 3630: 1, 3620: 1, 3608: 1, 3598: 1, 3596: 1, 3594: 1, 3580: 1, 3576: 1, 3575: 1, 3559: 1, 3558: 1, 3555: 1, 3554: 1, 3553: 1, 3550: 1, 3536: 1, 3535: 1, 3533: 1, 3523: 1, 3518: 1, 3513: 1, 3511: 1, 3505: 1, 3499: 1, 3497: 1, 3494: 1, 3492: 1, 3488: 1, 3479: 1, 3474: 1, 3470: 1, 3465: 1, 3459: 1, 3456: 1, 3453: 1, 3451: 1, 3450: 1, 3442: 1, 3435: 1, 3426: 1, 3424: 1, 3423: 1, 3413: 1, 3406: 1, 3403: 1, 3402: 1, 3397: 1, 3392: 1, 3389: 1, 3380: 1, 3377: 1, 3373: 1, 3368: 1, 3366: 1, 3365: 1, 3360: 1, 3346: 1, 3342: 1, 3337: 1, 3336: 1, 3335: 1, 3328: 1, 3324: 1, 3322: 1, 3321: 1, 3317: 1, 3313: 1, 3308: 1, 3306: 1, 3303: 1, 3301: 1, 3300: 1, 3299: 1, 3296: 1, 3290: 1, 3282: 1, 3281: 1, 3280: 1, 3275: 1, 3273: 1, 3270: 1, 3265: 1, 3255: 1, 3252: 1, 3241: 1, 3239: 1, 3233: 1, 3225: 1, 3222: 1, 3218: 1, 3215: 1, 3214: 1, 3207: 1, 3201: 1, 3193: 1, 3191: 1, 3181: 1, 3180: 1, 3177: 1, 3165: 1, 3160: 1, 3150: 1, 3148: 1, 3141: 1, 3140: 1, 3132: 1, 3130: 1, 3116: 1, 3105: 1, 3094: 1, 3092: 1, 3084: 1, 3083: 1, 3074: 1, 3066: 1, 3060: 1, 3059: 1, 3057: 1, 3056: 1, 3052: 1, 3050: 1, 3049: 1, 3044: 1, 3039: 1, 3037: 1, 3032: 1, 3030: 1, 3029: 1, 3026: 1, 3025: 1, 3020: 1, 3019: 1, 2995: 1, 2992: 1, 2985: 1, 2983: 1, 2975: 1, 2970: 1, 2964: 1, 2956: 1, 2953: 1, 2949: 1, 2948: 1, 2945: 1, 2943: 1, 2942: 1, 2941: 1, 2932: 1, 2930: 1, 2911: 1, 2904: 1, 2903: 1, 2902: 1, 2898: 1, 2895: 1, 2888: 1, 2886: 1, 2883: 1, 2877: 1, 2869: 1, 2867: 1, 2857: 1, 2850: 1, 2836: 1, 2834: 1, 2828: 1, 2826: 1, 2823: 1, 2818: 1, 2815: 1, 2814: 1, 2813: 1, 2811: 1, 2804: 1, 2787: 1, 2786: 1, 2782: 1, 2777: 1, 2760: 1, 2756: 1, 2748: 1, 2745: 1, 2727: 1, 2725: 1, 2723: 1, 2721: 1, 2720: 1, 2715: 1, 2713: 1, 2699: 1, 2694: 1, 2692: 1, 2691: 1, 2690: 1, 2683: 1, 2682: 1, 2681: 1, 2679: 1, 2677: 1, 2671: 1, 2668: 1, 2658: 1, 2650: 1, 2643: 1, 2641: 1, 2638: 1, 2637: 1, 2634: 1, 2632: 1, 2630: 1, 2628: 1, 2621: 1, 2619: 1, 2605: 1, 2600: 1, 2596: 1, 2590: 1, 2589: 1, 2588: 1, 2584: 1, 2583: 1, 2581: 1, 2579: 1, 2577: 1, 2573: 1, 2567: 1, 2566: 1, 2563: 1, 2562: 1, 2554: 1, 2550: 1, 2548: 1, 2547: 1, 2546: 1, 2541: 1, 2537: 1, 2526: 1, 2515: 1, 2514: 1, 2511: 1, 2503: 1, 2500: 1, 2499: 1, 2498: 1, 2497: 1, 2495: 1, 2492: 1, 2491: 1, 2490: 1, 2488: 1, 2486: 1, 2482: 1, 2473: 1, 2472: 1, 2469: 1, 2468: 1, 2466: 1, 2463: 1, 2462: 1, 2461: 1, 2458: 1, 2457: 1, 2454: 1, 2453: 1, 2452: 1, 2451: 1, 2438: 1, 2436: 1, 2424: 1, 2423: 1, 2421: 1, 2420: 1, 2408: 1, 2404: 1, 2402: 1, 2398: 1, 2396: 1, 2395: 1, 2391: 1, 2381: 1, 2374: 1, 2372: 1, 2369: 1, 2367: 1, 2366: 1, 2364: 1, 2363: 1, 2358: 1, 2356: 1, 2352: 1, 2351: 1, 2345: 1, 2342: 1, 2341: 1, 2337: 1, 2336: 1, 2334: 1, 2331: 1, 2325: 1, 2323: 1, 2318: 1, 2309: 1, 2300: 1, 2296: 1, 2294: 1, 2292: 1, 2287: 1, 2283: 1, 2279: 1, 2273: 1, 2265: 1, 2264: 1, 2262: 1, 2261: 1, 2242: 1, 2241: 1, 2236: 1, 2227: 1, 2223: 1, 2219: 1, 2218: 1, 2208: 1, 2206: 1, 2205: 1, 2198: 1, 2194: 1, 2183: 1, 2180: 1, 2179: 1, 2178: 1, 2176: 1, 2175: 1, 2172: 1, 2171: 1, 2169: 1, 2167: 1, 2161: 1, 2159: 1, 2148: 1, 2143: 1, 2142: 1, 2140: 1, 2139: 1, 2129: 1, 2126: 1, 2123: 1, 2119: 1, 2113: 1, 2112: 1, 2109: 1, 2108: 1, 2106: 1, 2105: 1, 2102: 1, 2099: 1, 2095: 1, 2094: 1, 2092: 1, 2091: 1, 2081: 1, 2078: 1, 2076: 1, 2063: 1, 2062: 1, 2058: 1, 2056: 1, 2054: 1, 2052: 1, 2047: 1, 2046: 1, 2045: 1, 2043: 1, 2039: 1, 2024: 1, 2018: 1, 2017: 1, 2015: 1, 2014: 1, 2012: 1, 2009: 1, 1999: 1, 1997: 1, 1990: 1, 1984: 1, 1983: 1, 1980: 1, 1976: 1, 1975: 1, 1969: 1, 1968: 1, 1966: 1, 1959: 1, 1956: 1, 1954: 1, 1952: 1, 1949: 1, 1947: 1, 1946: 1, 1945: 1, 1944: 1, 1942: 1, 1941: 1, 1939: 1, 1935: 1, 1932: 1, 1930: 1, 1929: 1, 1925: 1, 1913: 1, 1907: 1, 1906: 1, 1904: 1, 1895: 1, 1893: 1, 1892: 1, 1884: 1, 1881: 1, 1879: 1, 1877: 1, 1875: 1, 1873: 1, 1870: 1, 1867: 1, 1861: 1, 1860: 1, 1856: 1, 1853: 1, 1849: 1, 1847: 1, 1846: 1, 1840: 1, 1834: 1, 1833: 1, 1829: 1, 1828: 1, 1826: 1, 1825: 1, 1824: 1, 1821: 1, 1819: 1, 1813: 1, 1812: 1, 1805: 1, 1799: 1, 1798: 1, 1795: 1, 1790: 1, 1787: 1, 1786: 1, 1785: 1, 1781: 1, 1780: 1, 1778: 1, 1773: 1, 1771: 1, 1768: 1, 1766: 1, 1765: 1, 1764: 1, 1763: 1, 1760: 1, 1755: 1, 1753: 1, 1749: 1, 1748: 1, 1747: 1, 1743: 1, 1742: 1, 1741: 1, 1740: 1, 1737: 1, 1736: 1, 1734: 1, 1731: 1, 1728: 1, 1727: 1, 1721: 1, 1719: 1, 1718: 1, 1716: 1, 1714: 1, 1710: 1, 1706: 1, 1699: 1, 1692: 1, 1691: 1, 1690: 1, 1687: 1, 1683: 1, 1679: 1, 1677: 1, 1672: 1, 1670: 1, 1668: 1, 1663: 1, 1659: 1, 1658: 1, 1656: 1, 1654: 1, 1651: 1, 1648: 1, 1643: 1, 1641: 1, 1637: 1, 1635: 1, 1631: 1, 1627: 1, 1622: 1, 1614: 1, 1613: 1, 1611: 1, 1610: 1, 1608: 1, 1605: 1, 1604: 1, 1601: 1, 1595: 1, 1594: 1, 1593: 1, 1589: 1, 1587: 1, 1585: 1, 1581: 1, 1577: 1, 1574: 1, 1573: 1, 1572: 1, 1566: 1, 1565: 1, 1562: 1, 1561: 1, 1557: 1, 1554: 1, 1552: 1, 1549: 1, 1548: 1, 1545: 1, 1544: 1, 1542: 1, 1538: 1, 1534: 1, 1533: 1, 1531: 1, 1528: 1, 1526: 1, 1524: 1, 1523: 1, 1517: 1, 1514: 1, 1507: 1, 1506: 1, 1504: 1, 1503: 1, 1502: 1, 1500: 1, 1499: 1, 1498: 1, 1497: 1, 1494: 1, 1491: 1, 1490: 1, 1484: 1, 1480: 1, 1477: 1, 1476: 1, 1474: 1, 1472: 1, 1467: 1, 1465: 1, 1463: 1, 1462: 1, 1458: 1, 1455: 1, 1453: 1, 1446: 1, 1444: 1, 1442: 1, 1440: 1, 1433: 1, 1430: 1, 1427: 1, 1423: 1, 1420: 1, 1419: 1, 1418: 1, 1417: 1, 1416: 1, 1413: 1, 1412: 1, 1407: 1, 1406: 1, 1402: 1, 1397: 1, 1395: 1, 1393: 1, 1388: 1, 1387: 1, 1386: 1, 1385: 1, 1384: 1, 1382: 1, 1375: 1, 1374: 1, 1371: 1, 1370: 1, 1362: 1, 1361: 1, 1360: 1, 1352: 1, 1350: 1, 1348: 1, 1346: 1, 1339: 1, 1331: 1, 1327: 1, 1326: 1, 1325: 1, 1315: 1, 1313: 1, 1311: 1, 1310: 1, 1309: 1, 1308: 1, 1307: 1, 1302: 1, 1299: 1, 1292: 1, 1290: 1, 1288: 1, 1287: 1, 1286: 1, 1282: 1, 1270: 1, 1266: 1, 1258: 1, 1256: 1, 1248: 1, 1247: 1, 1246: 1, 1244: 1, 1242: 1, 1237: 1, 1234: 1, 1233: 1, 1226: 1, 1225: 1, 1224: 1, 1222: 1, 1221: 1, 1220: 1, 1219: 1, 1218: 1, 1216: 1, 1215: 1, 1208: 1, 1206: 1, 1203: 1, 1202: 1, 1194: 1, 1193: 1, 1189: 1, 1188: 1, 1187: 1, 1184: 1, 1175: 1, 1169: 1, 1167: 1, 1166: 1, 1165: 1, 1160: 1, 1155: 1, 1146: 1, 1144: 1, 1143: 1, 1139: 1, 1137: 1, 1134: 1, 1133: 1, 1131: 1, 1130: 1, 1126: 1, 1121: 1, 1120: 1, 1111: 1, 1108: 1, 1106: 1, 1103: 1, 1102: 1, 1096: 1, 1093: 1, 1092: 1, 1084: 1, 1083: 1, 1082: 1, 1079: 1, 1078: 1, 1077: 1, 1076: 1, 1075: 1, 1074: 1, 1072: 1, 1068: 1, 1067: 1, 1066: 1, 1063: 1, 1062: 1, 1060: 1, 1054: 1, 1052: 1, 1050: 1, 1047: 1, 1046: 1, 1040: 1, 1037: 1, 1033: 1, 1029: 1, 1026: 1, 1024: 1, 1018: 1, 1017: 1, 1015: 1, 1014: 1, 1013: 1, 1010: 1, 1008: 1, 1007: 1, 1004: 1, 1001: 1, 997: 1, 990: 1, 980: 1, 977: 1, 971: 1, 969: 1, 968: 1, 966: 1, 965: 1, 964: 1, 953: 1, 949: 1, 947: 1, 942: 1, 935: 1, 927: 1, 926: 1, 925: 1, 924: 1, 922: 1, 921: 1, 918: 1, 916: 1, 913: 1, 910: 1, 902: 1, 898: 1, 895: 1, 894: 1, 885: 1, 877: 1, 876: 1, 875: 1, 871: 1, 865: 1, 855: 1, 853: 1, 850: 1, 844: 1, 843: 1, 842: 1, 839: 1, 838: 1, 833: 1, 832: 1, 831: 1, 830: 1, 827: 1, 821: 1, 816: 1, 809: 1, 808: 1, 804: 1, 798: 1, 797: 1, 791: 1, 788: 1, 784: 1, 778: 1, 775: 1, 769: 1, 767: 1, 761: 1, 752: 1, 744: 1, 740: 1, 737: 1, 733: 1, 730: 1, 725: 1, 722: 1, 721: 1, 718: 1, 716: 1, 710: 1, 706: 1, 701: 1, 674: 1, 671: 1, 666: 1, 663: 1, 658: 1, 644: 1, 641: 1, 620: 1, 603: 1, 594: 1, 590: 1, 569: 1, 545: 1, 535: 1, 505: 1, 504: 1, 501: 1, 477: 1})\n"
]
}
],
"source": [
"# Number of words for a given frequency.\n",
"print(Counter(sorted_text_occur))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets build the model with only ***text*** column"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"For values of alpha = 1e-05 The log loss is: 1.41844647730872\n",
"For values of alpha = 0.0001 The log loss is: 1.4032643829006644\n",
"For values of alpha = 0.001 The log loss is: 1.2472606448726806\n",
"For values of alpha = 0.01 The log loss is: 1.373599291953443\n",
"For values of alpha = 0.1 The log loss is: 1.4989877718907707\n",
"For values of alpha = 1 The log loss is: 1.670050102338605\n"
]
}
],
"source": [
"cv_log_error_array=[]\n",
"for i in alpha:\n",
" clf = SGDClassifier(alpha=i, penalty='l2', loss='log', random_state=42)\n",
" clf.fit(train_text_feature_onehotCoding, y_train)\n",
" \n",
" sig_clf = CalibratedClassifierCV(clf, method=\"sigmoid\")\n",
" sig_clf.fit(train_text_feature_onehotCoding, y_train)\n",
" predict_y = sig_clf.predict_proba(cv_text_feature_onehotCoding)\n",
" cv_log_error_array.append(log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))\n",
" print('For values of alpha = ', i, \"The log loss is:\",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))\n"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig, ax = plt.subplots()\n",
"ax.plot(alpha, cv_log_error_array,c='g')\n",
"for i, txt in enumerate(np.round(cv_log_error_array,3)):\n",
" ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))\n",
"plt.grid()\n",
"plt.title(\"Cross Validation Error for each alpha\")\n",
"plt.xlabel(\"Alpha i's\")\n",
"plt.ylabel(\"Error measure\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"For values of best alpha = 0.001 The train log loss is: 0.7930034992633752\n",
"For values of best alpha = 0.001 The cross validation log loss is: 1.2472606448726806\n",
"For values of best alpha = 0.001 The test log loss is: 1.1086428559963621\n"
]
}
],
"source": [
"best_alpha = np.argmin(cv_log_error_array)\n",
"clf = SGDClassifier(alpha=alpha[best_alpha], penalty='l2', loss='log', random_state=42)\n",
"clf.fit(train_text_feature_onehotCoding, y_train)\n",
"sig_clf = CalibratedClassifierCV(clf, method=\"sigmoid\")\n",
"sig_clf.fit(train_text_feature_onehotCoding, y_train)\n",
"\n",
"predict_y = sig_clf.predict_proba(train_text_feature_onehotCoding)\n",
"print('For values of best alpha = ', alpha[best_alpha], \"The train log loss is:\",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))\n",
"predict_y = sig_clf.predict_proba(cv_text_feature_onehotCoding)\n",
"print('For values of best alpha = ', alpha[best_alpha], \"The cross validation log loss is:\",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))\n",
"predict_y = sig_clf.predict_proba(test_text_feature_onehotCoding)\n",
"print('For values of best alpha = ', alpha[best_alpha], \"The test log loss is:\",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets check the overlap of text data"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [],
"source": [
"def get_intersec_text(df):\n",
" df_text_vec = CountVectorizer(min_df=3)\n",
" df_text_fea = df_text_vec.fit_transform(df['TEXT'])\n",
" df_text_features = df_text_vec.get_feature_names()\n",
"\n",
" df_text_fea_counts = df_text_fea.sum(axis=0).A1\n",
" df_text_fea_dict = dict(zip(list(df_text_features),df_text_fea_counts))\n",
" len1 = len(set(df_text_features))\n",
" len2 = len(set(train_text_features) & set(df_text_features))\n",
" return len1,len2"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"98.087 % of word of test data appeared in train data\n",
"97.497 % of word of Cross Validation appeared in train data\n"
]
}
],
"source": [
"len1,len2 = get_intersec_text(test_df)\n",
"print(np.round((len2/len1)*100, 3), \"% of word of test data appeared in train data\")\n",
"len1,len2 = get_intersec_text(cv_df)\n",
"print(np.round((len2/len1)*100, 3), \"% of word of Cross Validation appeared in train data\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, all 3 columns are going to be important.\n",
"\n",
"## Data prepration for Machine Learning models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets create few functions which we will be using later"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [],
"source": [
"def report_log_loss(train_x, train_y, test_x, test_y, clf):\n",
" clf.fit(train_x, train_y)\n",
" sig_clf = CalibratedClassifierCV(clf, method=\"sigmoid\")\n",
" sig_clf.fit(train_x, train_y)\n",
" sig_clf_probs = sig_clf.predict_proba(test_x)\n",
" return log_loss(test_y, sig_clf_probs, eps=1e-15)"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {},
"outputs": [],
"source": [
"# This function plots the confusion matrices given y_i, y_i_hat.\n",
"def plot_confusion_matrix(test_y, predict_y):\n",
" C = confusion_matrix(test_y, predict_y)\n",
" \n",
" A =(((C.T)/(C.sum(axis=1))).T)\n",
" \n",
" B =(C/C.sum(axis=0)) \n",
" labels = [1,2,3,4,5,6,7,8,9]\n",
" # representing A in heatmap format\n",
" print(\"-\"*20, \"Confusion matrix\", \"-\"*20)\n",
" plt.figure(figsize=(20,7))\n",
" sns.heatmap(C, annot=True, cmap=\"YlGnBu\", fmt=\".3f\", xticklabels=labels, yticklabels=labels)\n",
" plt.xlabel('Predicted Class')\n",
" plt.ylabel('Original Class')\n",
" plt.show()\n",
"\n",
" print(\"-\"*20, \"Precision matrix (Columm Sum=1)\", \"-\"*20)\n",
" plt.figure(figsize=(20,7))\n",
" sns.heatmap(B, annot=True, cmap=\"YlGnBu\", fmt=\".3f\", xticklabels=labels, yticklabels=labels)\n",
" plt.xlabel('Predicted Class')\n",
" plt.ylabel('Original Class')\n",
" plt.show()\n",
" \n",
" # representing B in heatmap format\n",
" print(\"-\"*20, \"Recall matrix (Row sum=1)\", \"-\"*20)\n",
" plt.figure(figsize=(20,7))\n",
" sns.heatmap(A, annot=True, cmap=\"YlGnBu\", fmt=\".3f\", xticklabels=labels, yticklabels=labels)\n",
" plt.xlabel('Predicted Class')\n",
" plt.ylabel('Original Class')\n",
" plt.show()\n",
"\n",
"\n",
"def predict_and_plot_confusion_matrix(train_x, train_y,test_x, test_y, clf):\n",
" clf.fit(train_x, train_y)\n",
" sig_clf = CalibratedClassifierCV(clf, method=\"sigmoid\")\n",
" sig_clf.fit(train_x, train_y)\n",
" pred_y = sig_clf.predict(test_x)\n",
"\n",
" # for calculating log_loss we willl provide the array of probabilities belongs to each class\n",
" print(\"Log loss :\",log_loss(test_y, sig_clf.predict_proba(test_x)))\n",
" # calculating the number of data points that are misclassified\n",
" print(\"Number of mis-classified points :\", np.count_nonzero((pred_y- test_y))/test_y.shape[0])\n",
" plot_confusion_matrix(test_y, pred_y)"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [],
"source": [
"# this function will be used just for naive bayes\n",
"# for the given indices, we will print the name of the features\n",
"# and we will check whether the feature present in the test point text or not\n",
"def get_impfeature_names(indices, text, gene, var, no_features):\n",
" gene_count_vec = CountVectorizer()\n",
" var_count_vec = CountVectorizer()\n",
" text_count_vec = CountVectorizer(min_df=3)\n",
" \n",
" gene_vec = gene_count_vec.fit(train_df['Gene'])\n",
" var_vec = var_count_vec.fit(train_df['Variation'])\n",
" text_vec = text_count_vec.fit(train_df['TEXT'])\n",
" \n",
" fea1_len = len(gene_vec.get_feature_names())\n",
" fea2_len = len(var_count_vec.get_feature_names())\n",
" \n",
" word_present = 0\n",
" for i,v in enumerate(indices):\n",
" if (v < fea1_len):\n",
" word = gene_vec.get_feature_names()[v]\n",
" yes_no = True if word == gene else False\n",
" if yes_no:\n",
" word_present += 1\n",
" print(i, \"Gene feature [{}] present in test data point [{}]\".format(word,yes_no))\n",
" elif (v < fea1_len+fea2_len):\n",
" word = var_vec.get_feature_names()[v-(fea1_len)]\n",
" yes_no = True if word == var else False\n",
" if yes_no:\n",
" word_present += 1\n",
" print(i, \"variation feature [{}] present in test data point [{}]\".format(word,yes_no))\n",
" else:\n",
" word = text_vec.get_feature_names()[v-(fea1_len+fea2_len)]\n",
" yes_no = True if word in text.split() else False\n",
" if yes_no:\n",
" word_present += 1\n",
" print(i, \"Text feature [{}] present in test data point [{}]\".format(word,yes_no))\n",
"\n",
" print(\"Out of the top \",no_features,\" features \", word_present, \"are present in query point\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Combining all 3 features together"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [],
"source": [
"# merging gene, variance and text features\n",
"\n",
"train_gene_var_onehotCoding = hstack((train_gene_feature_onehotCoding,train_variation_feature_onehotCoding))\n",
"test_gene_var_onehotCoding = hstack((test_gene_feature_onehotCoding,test_variation_feature_onehotCoding))\n",
"cv_gene_var_onehotCoding = hstack((cv_gene_feature_onehotCoding,cv_variation_feature_onehotCoding))\n",
"\n",
"train_x_onehotCoding = hstack((train_gene_var_onehotCoding, train_text_feature_onehotCoding)).tocsr()\n",
"train_y = np.array(list(train_df['Class']))\n",
"\n",
"test_x_onehotCoding = hstack((test_gene_var_onehotCoding, test_text_feature_onehotCoding)).tocsr()\n",
"test_y = np.array(list(test_df['Class']))\n",
"\n",
"cv_x_onehotCoding = hstack((cv_gene_var_onehotCoding, cv_text_feature_onehotCoding)).tocsr()\n",
"cv_y = np.array(list(cv_df['Class']))\n",
"\n",
"\n",
"train_gene_var_responseCoding = np.hstack((train_gene_feature_responseCoding,train_variation_feature_responseCoding))\n",
"test_gene_var_responseCoding = np.hstack((test_gene_feature_responseCoding,test_variation_feature_responseCoding))\n",
"cv_gene_var_responseCoding = np.hstack((cv_gene_feature_responseCoding,cv_variation_feature_responseCoding))\n",
"\n",
"train_x_responseCoding = np.hstack((train_gene_var_responseCoding, train_text_feature_responseCoding))\n",
"test_x_responseCoding = np.hstack((test_gene_var_responseCoding, test_text_feature_responseCoding))\n",
"cv_x_responseCoding = np.hstack((cv_gene_var_responseCoding, cv_text_feature_responseCoding))\n"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"One hot encoding features :\n",
"(number of data points * number of features) in train data = (2124, 57112)\n",
"(number of data points * number of features) in test data = (665, 57112)\n",
"(number of data points * number of features) in cross validation data = (532, 57112)\n"
]
}
],
"source": [
"print(\"One hot encoding features :\")\n",
"print(\"(number of data points * number of features) in train data = \", train_x_onehotCoding.shape)\n",
"print(\"(number of data points * number of features) in test data = \", test_x_onehotCoding.shape)\n",
"print(\"(number of data points * number of features) in cross validation data =\", cv_x_onehotCoding.shape)"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Response encoding features :\n",
"(number of data points * number of features) in train data = (2124, 27)\n",
"(number of data points * number of features) in test data = (665, 27)\n",
"(number of data points * number of features) in cross validation data = (532, 27)\n"
]
}
],
"source": [
"print(\" Response encoding features :\")\n",
"print(\"(number of data points * number of features) in train data = \", train_x_responseCoding.shape)\n",
"print(\"(number of data points * number of features) in test data = \", test_x_responseCoding.shape)\n",
"print(\"(number of data points * number of features) in cross validation data =\", cv_x_responseCoding.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Building Machine Learning model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets start the first model which is most suitable when we have lot of text column data. So, we will start with Naive Bayes."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Random Forest Classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model with One hot encoder"
]
},
{
"cell_type": "code",
"execution_count": 105,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"for n_estimators = 100 and max depth = 5\n",
"Log Loss : 1.2572236746892875\n",
"for n_estimators = 100 and max depth = 10\n",
"Log Loss : 1.1861346638575951\n",
"for n_estimators = 200 and max depth = 5\n",
"Log Loss : 1.2488332258871315\n",
"for n_estimators = 200 and max depth = 10\n",
"Log Loss : 1.1781094364470721\n",
"for n_estimators = 500 and max depth = 5\n",
"Log Loss : 1.2432449644078098\n",
"for n_estimators = 500 and max depth = 10\n",
"Log Loss : 1.175471149755794\n",
"for n_estimators = 1000 and max depth = 5\n",
"Log Loss : 1.2439282130693075\n",
"for n_estimators = 1000 and max depth = 10\n",
"Log Loss : 1.174651527267681\n",
"for n_estimators = 2000 and max depth = 5\n",
"Log Loss : 1.242552498227296\n",
"for n_estimators = 2000 and max depth = 10\n",
"Log Loss : 1.1710021382121305\n"
]
}
],
"source": [
"alpha = [100,200,500,1000,2000]\n",
"max_depth = [5, 10]\n",
"cv_log_error_array = []\n",
"for i in alpha:\n",
" for j in max_depth:\n",
" print(\"for n_estimators =\", i,\"and max depth = \", j)\n",
" clf = RandomForestClassifier(n_estimators=i, criterion='gini', max_depth=j, random_state=42, n_jobs=-1)\n",
" clf.fit(train_x_onehotCoding, train_y)\n",
" sig_clf = CalibratedClassifierCV(clf, method=\"sigmoid\")\n",
" sig_clf.fit(train_x_onehotCoding, train_y)\n",
" sig_clf_probs = sig_clf.predict_proba(cv_x_onehotCoding)\n",
" cv_log_error_array.append(log_loss(cv_y, sig_clf_probs, labels=clf.classes_, eps=1e-15))\n",
" print(\"Log Loss :\",log_loss(cv_y, sig_clf_probs)) \n"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"For values of best estimator = 2000 The train log loss is: 0.6919423088243477\n",
"For values of best estimator = 2000 The cross validation log loss is: 1.1710021382121305\n",
"For values of best estimator = 2000 The test log loss is: 1.0918848509484709\n"
]
}
],
"source": [
"best_alpha = np.argmin(cv_log_error_array)\n",
"clf = RandomForestClassifier(n_estimators=alpha[int(best_alpha/2)], criterion='gini', max_depth=max_depth[int(best_alpha%2)], random_state=42, n_jobs=-1)\n",
"clf.fit(train_x_onehotCoding, train_y)\n",
"sig_clf = CalibratedClassifierCV(clf, method=\"sigmoid\")\n",
"sig_clf.fit(train_x_onehotCoding, train_y)\n",
"\n",
"predict_y = sig_clf.predict_proba(train_x_onehotCoding)\n",
"print('For values of best estimator = ', alpha[int(best_alpha/2)], \"The train log loss is:\",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))\n",
"predict_y = sig_clf.predict_proba(cv_x_onehotCoding)\n",
"print('For values of best estimator = ', alpha[int(best_alpha/2)], \"The cross validation log loss is:\",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))\n",
"predict_y = sig_clf.predict_proba(test_x_onehotCoding)\n",
"print('For values of best estimator = ', alpha[int(best_alpha/2)], \"The test log loss is:\",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lets test it on testing data using best hyper param"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Log loss : 1.1710021382121305\n",
"Number of mis-classified points : 0.39285714285714285\n",
"-------------------- Confusion matrix --------------------\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x504 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"-------------------- Precision matrix (Columm Sum=1) --------------------\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x504 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"-------------------- Recall matrix (Row sum=1) --------------------\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x504 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"clf = RandomForestClassifier(n_estimators=alpha[int(best_alpha/2)], criterion='gini', max_depth=max_depth[int(best_alpha%2)], random_state=42, n_jobs=-1)\n",
"predict_and_plot_confusion_matrix(train_x_onehotCoding, train_y,cv_x_onehotCoding,cv_y, clf)"
]
},
{
"cell_type": "code",
"execution_count": 108,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Predicted Class : 4\n",
"Predicted Class Probabilities: [[0.2876 0.0616 0.0163 0.4507 0.0513 0.0497 0.0647 0.0107 0.0074]]\n",
"Actual Class : 4\n",
"--------------------------------------------------\n",
"3 Text feature [inhibitors] present in test data point [True]\n",
"4 Text feature [activation] present in test data point [True]\n",
"6 Text feature [missense] present in test data point [True]\n",
"8 Text feature [suppressor] present in test data point [True]\n",
"12 Text feature [nonsense] present in test data point [True]\n",
"15 Text feature [function] present in test data point [True]\n",
"16 Text feature [growth] present in test data point [True]\n",
"20 Text feature [functional] present in test data point [True]\n",
"23 Text feature [loss] present in test data point [True]\n",
"24 Text feature [cells] present in test data point [True]\n",
"26 Text feature [downstream] present in test data point [True]\n",
"31 Text feature [patients] present in test data point [True]\n",
"35 Text feature [protein] present in test data point [True]\n",
"37 Text feature [variants] present in test data point [True]\n",
"39 Text feature [frameshift] present in test data point [True]\n",
"53 Text feature [clinical] present in test data point [True]\n",
"54 Text feature [transformation] present in test data point [True]\n",
"63 Text feature [cell] present in test data point [True]\n",
"70 Text feature [deleterious] present in test data point [True]\n",
"79 Text feature [ligand] present in test data point [True]\n",
"81 Text feature [lines] present in test data point [True]\n",
"82 Text feature [patient] present in test data point [True]\n",
"85 Text feature [functions] present in test data point [True]\n",
"86 Text feature [likelihood] present in test data point [True]\n",
"87 Text feature [proteins] present in test data point [True]\n",
"88 Text feature [expression] present in test data point [True]\n",
"94 Text feature [assays] present in test data point [True]\n",
"95 Text feature [nuclear] present in test data point [True]\n",
"99 Text feature [amplification] present in test data point [True]\n",
"Out of the top 100 features 29 are present in query point\n"
]
}
],
"source": [
"# test_point_index = 10\n",
"clf = RandomForestClassifier(n_estimators=alpha[int(best_alpha/2)], criterion='gini', max_depth=max_depth[int(best_alpha%2)], random_state=42, n_jobs=-1)\n",
"clf.fit(train_x_onehotCoding, train_y)\n",
"sig_clf = CalibratedClassifierCV(clf, method=\"sigmoid\")\n",
"sig_clf.fit(train_x_onehotCoding, train_y)\n",
"\n",
"test_point_index = 1\n",
"no_feature = 100\n",
"predicted_cls = sig_clf.predict(test_x_onehotCoding[test_point_index])\n",
"print(\"Predicted Class :\", predicted_cls[0])\n",
"print(\"Predicted Class Probabilities:\", np.round(sig_clf.predict_proba(test_x_onehotCoding[test_point_index]),4))\n",
"print(\"Actual Class :\", test_y[test_point_index])\n",
"indices = np.argsort(-clf.feature_importances_)\n",
"print(\"-\"*50)\n",
"get_impfeature_names(indices[:no_feature], test_df['TEXT'].iloc[test_point_index],test_df['Gene'].iloc[test_point_index],test_df['Variation'].iloc[test_point_index], no_feature)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## RF with Response Coding"
]
},
{
"cell_type": "code",
"execution_count": 109,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"for n_estimators = 10 and max depth = 2\n",
"Log Loss : 2.3303043307247706\n",
"for n_estimators = 10 and max depth = 3\n",
"Log Loss : 1.759233082123313\n",
"for n_estimators = 10 and max depth = 5\n",
"Log Loss : 1.7222729656680074\n",
"for n_estimators = 10 and max depth = 10\n",
"Log Loss : 2.1879986740331585\n",
"for n_estimators = 50 and max depth = 2\n",
"Log Loss : 1.8102751305264946\n",
"for n_estimators = 50 and max depth = 3\n",
"Log Loss : 1.472457555108808\n",
"for n_estimators = 50 and max depth = 5\n",
"Log Loss : 1.3606410245246492\n",
"for n_estimators = 50 and max depth = 10\n",
"Log Loss : 1.735251061454875\n",
"for n_estimators = 100 and max depth = 2\n",
"Log Loss : 1.6397489083700045\n",
"for n_estimators = 100 and max depth = 3\n",
"Log Loss : 1.5150260265978719\n",
"for n_estimators = 100 and max depth = 5\n",
"Log Loss : 1.2914915334146437\n",
"for n_estimators = 100 and max depth = 10\n",
"Log Loss : 1.7844750814929151\n",
"for n_estimators = 200 and max depth = 2\n",
"Log Loss : 1.6697036744271037\n",
"for n_estimators = 200 and max depth = 3\n",
"Log Loss : 1.5108466525843012\n",
"for n_estimators = 200 and max depth = 5\n",
"Log Loss : 1.3845537584276042\n",
"for n_estimators = 200 and max depth = 10\n",
"Log Loss : 1.7551355678052738\n",
"for n_estimators = 500 and max depth = 2\n",
"Log Loss : 1.7301101308042508\n",
"for n_estimators = 500 and max depth = 3\n",
"Log Loss : 1.623695643294688\n",
"for n_estimators = 500 and max depth = 5\n",
"Log Loss : 1.3931988161051954\n",
"for n_estimators = 500 and max depth = 10\n",
"Log Loss : 1.7923450427705567\n",
"for n_estimators = 1000 and max depth = 2\n",
"Log Loss : 1.719548851364818\n",
"for n_estimators = 1000 and max depth = 3\n",
"Log Loss : 1.6161705513069253\n",
"for n_estimators = 1000 and max depth = 5\n",
"Log Loss : 1.3833366060183812\n",
"for n_estimators = 1000 and max depth = 10\n",
"Log Loss : 1.7841833756180365\n",
"For values of best alpha = 100 The train log loss is: 0.054288718396378756\n",
"For values of best alpha = 100 The cross validation log loss is: 1.2914915334146437\n",
"For values of best alpha = 100 The test log loss is: 1.2489713410389691\n"
]
}
],
"source": [
"alpha = [10,50,100,200,500,1000]\n",
"max_depth = [2,3,5,10]\n",
"cv_log_error_array = []\n",
"for i in alpha:\n",
" for j in max_depth:\n",
" print(\"for n_estimators =\", i,\"and max depth = \", j)\n",
" clf = RandomForestClassifier(n_estimators=i, criterion='gini', max_depth=j, random_state=42, n_jobs=-1)\n",
" clf.fit(train_x_responseCoding, train_y)\n",
" sig_clf = CalibratedClassifierCV(clf, method=\"sigmoid\")\n",
" sig_clf.fit(train_x_responseCoding, train_y)\n",
" sig_clf_probs = sig_clf.predict_proba(cv_x_responseCoding)\n",
" cv_log_error_array.append(log_loss(cv_y, sig_clf_probs, labels=clf.classes_, eps=1e-15))\n",
" print(\"Log Loss :\",log_loss(cv_y, sig_clf_probs)) \n",
"\n",
"\n",
"best_alpha = np.argmin(cv_log_error_array)\n",
"clf = RandomForestClassifier(n_estimators=alpha[int(best_alpha/4)], criterion='gini', max_depth=max_depth[int(best_alpha%4)], random_state=42, n_jobs=-1)\n",
"clf.fit(train_x_responseCoding, train_y)\n",
"sig_clf = CalibratedClassifierCV(clf, method=\"sigmoid\")\n",
"sig_clf.fit(train_x_responseCoding, train_y)\n",
"\n",
"predict_y = sig_clf.predict_proba(train_x_responseCoding)\n",
"print('For values of best alpha = ', alpha[int(best_alpha/4)], \"The train log loss is:\",log_loss(y_train, predict_y, labels=clf.classes_, eps=1e-15))\n",
"predict_y = sig_clf.predict_proba(cv_x_responseCoding)\n",
"print('For values of best alpha = ', alpha[int(best_alpha/4)], \"The cross validation log loss is:\",log_loss(y_cv, predict_y, labels=clf.classes_, eps=1e-15))\n",
"predict_y = sig_clf.predict_proba(test_x_responseCoding)\n",
"print('For values of best alpha = ', alpha[int(best_alpha/4)], \"The test log loss is:\",log_loss(y_test, predict_y, labels=clf.classes_, eps=1e-15))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Testing model with best hyper param"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Log loss : 1.2914915334146437\n",
"Number of mis-classified points : 0.4755639097744361\n",
"-------------------- Confusion matrix --------------------\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x504 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"-------------------- Precision matrix (Columm Sum=1) --------------------\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x504 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"-------------------- Recall matrix (Row sum=1) --------------------\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 1440x504 with 2 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"clf = RandomForestClassifier(max_depth=max_depth[int(best_alpha%4)], n_estimators=alpha[int(best_alpha/4)], criterion='gini', max_features='auto',random_state=42)\n",
"predict_and_plot_confusion_matrix(train_x_responseCoding, train_y,cv_x_responseCoding,cv_y, clf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Query the classified point"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Predicted Class : 4\n",
"Predicted Class Probabilities: [[0.2367 0.0189 0.0993 0.4894 0.0333 0.0447 0.0085 0.0305 0.0385]]\n",
"Actual Class : 4\n",
"--------------------------------------------------\n",
"Variation is important feature\n",
"Variation is important feature\n",
"Variation is important feature\n",
"Variation is important feature\n",
"Variation is important feature\n",
"Gene is important feature\n",
"Variation is important feature\n",
"Text is important feature\n",
"Text is important feature\n",
"Gene is important feature\n",
"Text is important feature\n",
"Text is important feature\n",
"Text is important feature\n",
"Gene is important feature\n",
"Variation is important feature\n",
"Gene is important feature\n",
"Text is important feature\n",
"Gene is important feature\n",
"Variation is important feature\n",
"Gene is important feature\n",
"Text is important feature\n",
"Text is important feature\n",
"Variation is important feature\n",
"Text is important feature\n",
"Gene is important feature\n",
"Gene is important feature\n",
"Gene is important feature\n"
]
}
],
"source": [
"clf = RandomForestClassifier(n_estimators=alpha[int(best_alpha/4)], criterion='gini', max_depth=max_depth[int(best_alpha%4)], random_state=42, n_jobs=-1)\n",
"clf.fit(train_x_responseCoding, train_y)\n",
"sig_clf = CalibratedClassifierCV(clf, method=\"sigmoid\")\n",
"sig_clf.fit(train_x_responseCoding, train_y)\n",
"\n",
"\n",
"test_point_index = 1\n",
"no_feature = 27\n",
"predicted_cls = sig_clf.predict(test_x_responseCoding[test_point_index].reshape(1,-1))\n",
"print(\"Predicted Class :\", predicted_cls[0])\n",
"print(\"Predicted Class Probabilities:\", np.round(sig_clf.predict_proba(test_x_responseCoding[test_point_index].reshape(1,-1)),4))\n",
"print(\"Actual Class :\", test_y[test_point_index])\n",
"indices = np.argsort(-clf.feature_importances_)\n",
"print(\"-\"*50)\n",
"for i in indices:\n",
" if i<9:\n",
" print(\"Gene is important feature\")\n",
" elif i<18:\n",
" print(\"Variation is important feature\")\n",
" else:\n",
" print(\"Text is important feature\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}