3165 lines (3164 with data), 96.4 kB
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# Temporal-Comorbidity Adjusted Risk of Emergency Readmission (TCARER)\n",
"## <font style=\"font-weight:bold;color:gray\">Basic Models</font>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"[1. Initialise](#1.-Initialise)\n",
"<br\\>\n",
"[2. Generate Features](#2.-Generate-Features)\n",
"<br\\>\n",
"[3. Read Data](#3.-Read-Data)\n",
"<br\\>\n",
"[4. Filter Features](#4.-Filter-Features)\n",
"<br\\>\n",
"[5. Set Samples & Target Features](#5.-Set-Samples-&-Target-Features)\n",
"<br\\>\n",
"[6. Recategorise & Transform](#6.-Recategorise-&-Transform)\n",
"<br\\>\n",
"[7. Rank & Select Features](#7.-Rank-&-Select-Features)\n",
"<br\\>\n",
"[8. Model](#8.-Model)\n",
"<br\\>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This Jupyter IPython Notebook applies the Temporal-Comorbidity Adjusted Risk of Emergency Readmission (TCARER).\n",
"\n",
"This Jupyter IPython Notebook extract aggregated features from the MySQL database, & then pre-process, configure & apply several modelling approaches. \n",
"\n",
"The pre-processing framework & modelling algorithms in this Notebook are developed as part of the Integrated Care project at the <a href=\"http://www.healthcareanalytics.co.uk/\">Health & Social Care Modelling Group (HSCMG)</a>, The <a href=\"http://www.westminster.ac.uk\">University of Westminster</a>.\n",
"\n",
"Note that some of the scripts are optional or subject to some pre-configurations. Please refer to the comments & the project documentations for further details."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<hr\\>\n",
"<font size=\"1\" color=\"gray\">Copyright 2017 The Project Authors. All Rights Reserved.\n",
"\n",
"It is licensed under the Apache License, Version 2.0. you may not use this file except in compliance with the License. You may obtain a copy of the License at\n",
"\n",
" <a href=\"http://www.apache.org/licenses/LICENSE-2.0\">http://www.apache.org/licenses/LICENSE-2.0</a>\n",
"\n",
"Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.</font>\n",
"<hr\\>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## 1. Initialise"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Reload modules"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [],
"source": [
"# Reload modules \n",
"# It is an optional step. It is useful to run when external Python modules are being modified\n",
"# It is reloading all modules (except those excluded by %aimport) every time before executing the Python code typed.\n",
"# Note: It may conflict with serialisation, when external modules are being modified\n",
"\n",
"# %load_ext autoreload \n",
"# %autoreload 2"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Import libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# Import Python libraries\n",
"import logging\n",
"import os\n",
"import sys\n",
"import gc\n",
"import pandas as pd\n",
"from IPython.display import display, HTML\n",
"from collections import OrderedDict\n",
"import numpy as np\n",
"import statistics\n",
"from scipy.stats import stats"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# Import local Python modules\n",
"from Configs.CONSTANTS import CONSTANTS\n",
"from Configs.Logger import Logger\n",
"from Features.Variables import Variables\n",
"from ReadersWriters.ReadersWriters import ReadersWriters\n",
"from Stats.PreProcess import PreProcess\n",
"from Stats.FeatureSelection import FeatureSelection\n",
"from Stats.TrainingMethod import TrainingMethod\n",
"from Stats.Plots import Plots"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [],
"source": [
"# Check the interpreter\n",
"print(\"\\nMake sure the correct Python interpreter is used!\")\n",
"print(sys.version)\n",
"print(\"\\nMake sure sys.path of the Python interpreter is correct!\")\n",
"print(os.getcwd())"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 1.1. Initialise General Settings"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:red\">Main configuration Settings: </font>\n",
"- Specify the full path of the configuration file \n",
"<br/>	 → config_path\n",
"- Specify the full path of the output folder \n",
"<br/>	 → io_path\n",
"- Specify the application name (the suffix of the outputs file name) \n",
"<br/>	 → app_name\n",
"- Specify the sub-model name, to locate the related feature configuration, based on the \"Table_Reference_Name\" column in the configuration file\n",
"<br/>	 → submodel_name\n",
"- Specify the sub-model's the file name of the input (excluding the CSV extension)\n",
"<br/>	 → submodel_input_name\n",
"<br/>\n",
"<br/>\n",
"\n",
"<font style=\"font-weight:bold;color:red\">External Configration Files: </font>\n",
"- The MySQL database configuration setting & other configration metadata\n",
"<br/>	 → <i>Inputs/CONFIGURATIONS_1.ini</i>\n",
"- The input features' confugration file (Note: only the CSV export of the XLSX will be used by this Notebook)\n",
"<br/>	 → <i>Inputs/config_features_path.xlsx</i>\n",
"<br/>	 → <i>Inputs/config_features_path.csv</i>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"config_path = os.path.abspath(\"ConfigInputs/CONFIGURATIONS.ini\")\n",
"io_path = os.path.abspath(\"../../tmp/TCARER/Basic_prototype\")\n",
"app_name = \"T-CARER\"\n",
"submodel_name = \"hesIp\"\n",
"submodel_input_name = \"tcarer_model_features_ip\"\n",
"\n",
"print(\"\\n The full path of the configuration file: \\n\\t\", config_path,\n",
" \"\\n The full path of the output folder: \\n\\t\", io_path,\n",
" \"\\n The application name (the suffix of the outputs file name): \\n\\t\", app_name,\n",
" \"\\n The sub-model name, to locate the related feature configuration: \\n\\t\", submodel_name,\n",
" \"\\n The the sub-model's the file name of the input: \\n\\t\", submodel_input_name)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Initialise logs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"if not os.path.exists(io_path):\n",
" os.makedirs(io_path, exist_ok=True)\n",
"\n",
"logger = Logger(path=io_path, app_name=app_name, ext=\"log\")\n",
"logger = logging.getLogger(app_name)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Initialise constants and some of classes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [],
"source": [
"# Initialise constants \n",
"CONSTANTS.set(io_path, app_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# Initialise other classes\n",
"readers_writers = ReadersWriters()\n",
"preprocess = PreProcess(io_path)\n",
"feature_selection = FeatureSelection()\n",
"plts = Plots()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# Set print settings\n",
"pd.set_option('display.width', 1600, 'display.max_colwidth', 800)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 1.2. Initialise Features Metadata"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Read the input features' confugration file & store the features metadata"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [],
"source": [
"# variables settings\n",
"features_metadata = dict()\n",
"\n",
"features_metadata_all = readers_writers.load_csv(path=CONSTANTS.io_path, title=CONSTANTS.config_features_path, dataframing=True)\n",
"features_metadata = features_metadata_all.loc[(features_metadata_all[\"Selected\"] == 1) & \n",
" (features_metadata_all[\"Table_Reference_Name\"] == submodel_name)]\n",
"features_metadata.reset_index()\n",
" \n",
"# print\n",
"display(features_metadata)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Set input features' metadata dictionaries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# Dictionary of features types, dtypes, & max-states\n",
"features_types = dict()\n",
"features_dtypes = dict()\n",
"features_states_values = dict()\n",
"features_names_group = dict()\n",
"\n",
"for _, row in features_metadata.iterrows():\n",
" if not pd.isnull(row[\"Variable_Max_States\"]):\n",
" states_values = str(row[\"Variable_Max_States\"]).split(',') \n",
" states_values = list(map(int, states_values))\n",
" else: \n",
" states_values = None\n",
" \n",
" if not pd.isnull(row[\"Variable_Aggregation\"]):\n",
" postfixes = row[\"Variable_Aggregation\"].replace(' ', '').split(',')\n",
" f_types = row[\"Variable_Type\"].replace(' ', '').split(',')\n",
" f_dtypes = row[\"Variable_dType\"].replace(' ', '').split(',')\n",
" for p in range(len(postfixes)):\n",
" features_types[row[\"Variable_Name\"] + \"_\" + postfixes[p]] = f_types[p]\n",
" features_dtypes[row[\"Variable_Name\"] + \"_\" + postfixes[p]] = pd.Series(dtype=f_dtypes[p])\n",
" features_states_values[row[\"Variable_Name\"] + \"_\" + postfixes[p]] = states_values\n",
" features_names_group[row[\"Variable_Name\"] + \"_\" + postfixes[p]] = row[\"Variable_Name\"] + \"_\" + postfixes[p]\n",
" else:\n",
" features_types[row[\"Variable_Name\"]] = row[\"Variable_Type\"]\n",
" features_dtypes[row[\"Variable_Name\"]] = row[\"Variable_dType\"]\n",
" features_states_values[row[\"Variable_Name\"]] = states_values\n",
" features_names_group[row[\"Variable_Name\"]] = row[\"Variable_Name\"]\n",
" if states_values is not None:\n",
" for postfix in states_values:\n",
" features_names_group[row[\"Variable_Name\"] + \"_\" + str(postfix)] = row[\"Variable_Name\"]\n",
" \n",
"features_dtypes = pd.DataFrame(features_dtypes).dtypes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# Dictionary of features groups\n",
"features_types_group = OrderedDict()\n",
"\n",
"f_types = set([f_type for f_type in features_types.values()])\n",
"features_types_group = OrderedDict(zip(list(f_types), [set() for _ in range(len(f_types))]))\n",
"for f_name, f_type in features_types.items():\n",
" features_types_group[f_type].add(f_name)\n",
" \n",
"print(\"Available features types: \" + ','.join(f_types))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## <font style=\"font-weight:bold;color:red\">2. Generate Features</font>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:red\">Notes:</font>\n",
"- It generates the final spell-wise & temporal features from the MySQL table(s), & converts it into CSV(s);\n",
"- It generates the CSV(s) based on the configuration file of the features (Note: only the CSV export of the XLSX will be used by this Notebook)\n",
"<br/>	 → <i>Inputs/config_features_path.xlsx</i>\n",
"<br/>	 → <i>Inputs/config_features_path.csv</i>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"skip = True\n",
"\n",
"# settings\n",
"csv_schema = [\"my_db_schema\"]\n",
"csv_input_tables = [\"tcarer_features\"]\n",
"csv_history_tables = [\"hesIp\"]\n",
"csv_column_index = \"localID\"\n",
"csv_output_table = \"tcarer_model_features_ip\"\n",
"csv_query_batch_size = 100000"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"if skip is False:\n",
" # generate the csv file\n",
" variables = Variables(submodel_name,\n",
" CONSTANTS.io_path,\n",
" CONSTANTS.io_path,\n",
" CONSTANTS.config_features_path,\n",
" csv_output_table)\n",
" variables.set(csv_schema, csv_input_tables, csv_history_tables, csv_column_index, csv_query_batch_size)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## 3. Read Data"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Read the input features from the CSV input file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"features_input = readers_writers.load_csv(path=CONSTANTS.io_path, title=submodel_input_name, dataframing=True)\n",
"features_input.astype(dtype=features_dtypes)\n",
"\n",
"print(\"Number of columns: \", len(features_input.columns), \"; Total records: \", len(features_input.index))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Verify features visually"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"display(features_input.head())"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"## 4. Filter Features"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 4.1. Descriptive Statsistics"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Produce a descriptive stat report of 'Categorical', 'Continuous', & 'TARGET' features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"file_name = \"Step_04_Data_ColumnNames\"\n",
"readers_writers.save_csv(path=CONSTANTS.io_path, title=file_name, data=list(features_input.columns.values), append=False)\n",
"file_name = \"Step_04_Stats_Categorical\"\n",
"o_stats = preprocess.stats_discrete_df(df=features_input, includes=features_types_group[\"CATEGORICAL\"],\n",
" file_name=file_name)\n",
"file_name = \"Step_04_Stats_Continuous\"\n",
"o_stats = preprocess.stats_continuous_df(df=features_input, includes=features_types_group[\"CONTINUOUS\"], \n",
" file_name=file_name)\n",
"file_name = \"Step_04_Stats_Target\"\n",
"o_stats = preprocess.stats_discrete_df(df=features_input, includes=features_types_group[\"TARGET\"], \n",
" file_name=file_name)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 4.2. Selected Population"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### 4.2.1. Remove Excluded Population, Remove Unused Features"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<i>Nothing to do!<i/> \n",
"<br/>\n",
"<font style=\"font-weight:bold;color:red\">Notes: </font> \n",
"- Ideally the features must be configured before generating the CSV feature file, as it is very inefficient to derive new features at this stage\n",
"- This step is not necessary, if all the features are generated in prior to the generatiion of the CSV feature file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# Exclusion of unused features\n",
"# excluded = [name for name in features_input.columns if name not in features_names_group.keys()]\n",
"# features_input = features_input.drop(excluded, axis=1)\n",
"\n",
"# print(\"Number of columns: \", len(features_input.columns), \"; Total records: \", len(features_input.index))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## 5. Set Samples & Target Features"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"### 5.1. Set Features"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### 5.1.1. Train & Test Samples"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Set the samples"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"frac_train = 0.50\n",
"replace = False\n",
"random_state = 100\n",
"\n",
"nrows = len(features_input.index)\n",
"features = {\"train\": dict(), \"test\": dict()}\n",
"features[\"train\"] = features_input.sample(frac=frac_train, replace=False, random_state=100)\n",
"features[\"test\"] = features_input.drop(features[\"train\"].index)\n",
"\n",
"features[\"train\"] = features[\"train\"].reset_index(drop=True)\n",
"features[\"test\"] = features[\"test\"].reset_index(drop=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Verify features visually"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"display(features_input.head())"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:red\">Clean-Up</font>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"features_input = None\n",
"gc.collect()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### 5.1.2. Independent & Target variable¶"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Set independent, target & ID features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"target_labels = list(features_types_group[\"TARGET\"])\n",
"target_id = [\"patientID\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"features[\"train_indep\"] = dict()\n",
"features[\"train_target\"] = dict()\n",
"features[\"train_id\"] = dict()\n",
"features[\"test_indep\"] = dict()\n",
"features[\"test_target\"] = dict()\n",
"features[\"test_id\"] = dict()\n",
"\n",
"# Independent and target features\n",
"def set_features_indep_target(df):\n",
" df_targets = pd.DataFrame(dict(zip(target_labels, [[]] * len(target_labels))))\n",
" for i in range(len(target_labels)):\n",
" df_targets[target_labels[i]] = df[target_labels[i]]\n",
" \n",
" df_indep = df.drop(target_labels + target_id, axis=1)\n",
" df_id = pd.DataFrame({target_id[0]: df[target_id[0]]})\n",
" \n",
" return df_indep, df_targets, df_id"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# train & test sets\n",
"features[\"train_indep\"], features[\"train_target\"], features[\"train_id\"] = set_features_indep_target(features[\"train\"])\n",
"features[\"test_indep\"], features[\"test_target\"], features[\"test_id\"] = set_features_indep_target(features[\"test\"])\n",
"\n",
"# print \n",
"print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
"print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Verify features visually"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [],
"source": [
"display(pd.concat([features[\"train_id\"].head(), features[\"train_target\"].head(), features[\"train_indep\"].head()], axis=1))\n",
"display(pd.concat([features[\"test_id\"].head(), features[\"test_target\"].head(), features[\"test_indep\"].head()], axis=1))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:red\">Clean-Up</font>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"del features[\"train\"]\n",
"del features[\"test\"]\n",
"gc.collect()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 5.5. Save Samples"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Serialise & save the samples before any feature transformation. \n",
"<br/>This snapshot of the samples may be used for the population profiling"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [],
"source": [
"file_name = \"Step_05_Features\"\n",
"readers_writers.save_serialised_compressed(path=CONSTANTS.io_path, title=file_name, objects=features)\n",
"\n",
"# print\n",
"print(\"Number of columns: \", len(features[\"train_indep\"].columns), \n",
" \"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 5.2. Remove - Near Zero Variance\n",
"In order to reduce sparseness and invalid features, highly stationary ones were withdrawn. The features that had constant counts less than or equal a threshold were \f",
"ltered out, to exclude highly constants and near-zero variances.\n",
"\n",
"The near zero variance rules are presented in below:\n",
"- Frequency ratio: The frequency of the most prevalent value over the second most frequent value to be greater than a threshold;\n",
"- Percent of unique values: The number of unique values divided by the total number of samples to be greater than the threshold\n",
"\n",
"<font style=\"font-weight:bold;color:red\">Configure:</font> the function\n",
"- The cutoff for the percentage of distinct values out of the number of total samples (upper limit). e.g. 10 * 100 / 100\n",
"<br/>	 → thresh_unique_cut\n",
"- The cutoff for the ratio of the most common value to the second most common value (lower limit). eg. 95/5\n",
"<br/>	 → thresh_freq_cut"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"thresh_unique_cut = 100\n",
"thresh_freq_cut = 1000\n",
"\n",
"excludes = []\n",
"file_name = \"Step_05_Preprocess_NZV_config\"\n",
"features[\"train_indep\"], o_summaries = preprocess.near_zero_var_df(df=features[\"train_indep\"], \n",
" excludes=excludes, \n",
" file_name=file_name, \n",
" thresh_unique_cut=thresh_unique_cut, \n",
" thresh_freq_cut=thresh_freq_cut,\n",
" to_search=True)\n",
"\n",
"file_name = \"Step_05_Preprocess_NZV\"\n",
"readers_writers.save_text(path=CONSTANTS.io_path, title=file_name, data=o_summaries, append=False, ext=\"log\")\n",
"\n",
"file_name = \"Step_05_Preprocess_NZV_config\"\n",
"features[\"test_indep\"], o_summaries = preprocess.near_zero_var_df(df=features[\"test_indep\"], \n",
" excludes=excludes, \n",
" file_name=file_name, \n",
" thresh_unique_cut=thresh_unique_cut, \n",
" thresh_freq_cut=thresh_freq_cut,\n",
" to_search=False)\n",
"\n",
"# print\n",
"print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
"print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 5.3. Remove Highly Linearly Correlated\n",
"\n",
"In this step, features that were highly linearly correlated were excluded. \n",
"\n",
"<font style=\"font-weight:bold;color:red\">Configure:</font> the function\n",
"- A numeric value for the pair-wise absolute correlation cutoff. e.g. 0.95\n",
"<br/>	 → thresh_corr_cut"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"thresh_corr_cut = 0.95\n",
"\n",
"excludes = list(features_types_group[\"CATEGORICAL\"])\n",
"file_name = \"Step_05_Preprocess_Corr_config\"\n",
"features[\"train_indep\"], o_summaries = preprocess.high_linear_correlation_df(df=features[\"train_indep\"], \n",
" excludes=excludes, \n",
" file_name=file_name, \n",
" thresh_corr_cut=thresh_corr_cut,\n",
" to_search=True)\n",
"\n",
"file_name = \"Step_05_Preprocess_Corr\"\n",
"readers_writers.save_text(path=CONSTANTS.io_path, title=file_name, data=o_summaries, append=False, ext=\"log\")\n",
"\n",
"file_name = \"Step_05_Preprocess_Corr_config\"\n",
"features[\"test_indep\"], o_summaries = preprocess.high_linear_correlation_df(df=features[\"test_indep\"], \n",
" excludes=excludes, \n",
" file_name=file_name, \n",
" thresh_corr_cut=thresh_corr_cut,\n",
" to_search=False)\n",
"\n",
"# print\n",
"print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
"print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 5.4. Descriptive Statistics"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Produce a descriptive stat report of 'Categorical', 'Continuous', & 'TARGET' features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# columns\n",
"file_name = \"Step_05_Data_ColumnNames_Train\"\n",
"readers_writers.save_csv(path=CONSTANTS.io_path, title=file_name, \n",
" data=list(features[\"train_indep\"].columns.values), append=False)\n",
"\n",
"# Sample - Train\n",
"file_name = \"Step_05_Stats_Categorical_Train\"\n",
"o_stats = preprocess.stats_discrete_df(df=features[\"train_indep\"], includes=features_types_group[\"CATEGORICAL\"], \n",
" file_name=file_name)\n",
"file_name = \"Step_05_Stats_Continuous_Train\"\n",
"o_stats = preprocess.stats_continuous_df(df=features[\"train_indep\"], includes=features_types_group[\"CONTINUOUS\"], \n",
" file_name=file_name)\n",
"\n",
"# Sample - Test\n",
"file_name = \"Step_05_Stats_Categorical_Test\"\n",
"o_stats = preprocess.stats_discrete_df(df=features[\"test_indep\"], includes=features_types_group[\"CATEGORICAL\"],\n",
" file_name=file_name)\n",
"file_name = \"Step_05_Stats_Continuous_Test\"\n",
"o_stats = preprocess.stats_continuous_df(df=features[\"test_indep\"], includes=features_types_group[\"CONTINUOUS\"], \n",
" file_name=file_name)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## 6. Recategorise & Transform"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Verify features visually"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"display(pd.concat([features[\"train_id\"].head(), features[\"train_target\"].head(), features[\"train_indep\"].head()], axis=1))\n",
"display(pd.concat([features[\"test_id\"].head(), features[\"test_target\"].head(), features[\"test_indep\"].head()], axis=1))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 6.1. Recategorise"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Define the factorisation function to generate dummy features for the categorical features."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"def factorise_settings(max_categories_frac, min_categories_num, exclude_zero):\n",
" categories_dic = dict()\n",
" labels_dic = dict()\n",
" dtypes_dic = dict()\n",
" dummies = []\n",
" \n",
" for f_name in features_types_group[\"CATEGORICAL\"]:\n",
" if f_name in features[\"train_indep\"]:\n",
" # find top & valid states\n",
" summaries = stats.itemfreq(features[\"train_indep\"][f_name])\n",
" summaries = pd.DataFrame({\"value\": summaries[:, 0], \"freq\": summaries[:, 1]})\n",
" summaries[\"value\"] = list(map(int, summaries[\"value\"]))\n",
" summaries = summaries.sort_values(\"freq\", ascending=False)\n",
" summaries = list(summaries[\"value\"])\n",
"\n",
" # exclude zero state\n",
" if exclude_zero is True and len(summaries) > 1:\n",
" summaries = [s for s in summaries if s != 0]\n",
" \n",
" # if included in the states\n",
" summaries = [v for v in summaries if v in set(features_states_values[f_name])]\n",
"\n",
" # limit number of states\n",
" max_cnt = max(int(len(summaries) * max_categories_frac), min_categories_num)\n",
"\n",
" # set states\n",
" categories_dic[f_name] = summaries[0:max_cnt]\n",
" labels_dic[f_name] = [f_name + \"_\" + str(c) for c in categories_dic[f_name]]\n",
" dtypes_dic = {**dtypes_dic,\n",
" **dict(zip(labels_dic[f_name], [pd.Series(dtype='i') for _ in range(len(categories_dic[f_name]))]))}\n",
" dummies += labels_dic[f_name] \n",
" \n",
" dtypes_dic = pd.DataFrame(dtypes_dic).dtypes\n",
"\n",
" # print \n",
" print(\"Total Categorical Variables : \", len(categories_dic.keys()), \n",
" \"; Total Number of Dummy Variables: \", sum([len(categories_dic[f_name]) for f_name in categories_dic.keys()]))\n",
" return categories_dic, labels_dic, dtypes_dic, features_types"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Select categories: by order of freq., max_categories_frac, & max_categories_num\n",
"\n",
"<br/><font style=\"font-weight:bold;color:red\">Configure:</font> The input arguments are:\n",
"- Specify the maximum number of categories a feature can have\n",
"<br/>	 → max_categories_frac\n",
"- Specify the minimum number of categories a feature can have\n",
"<br/>	 → min_categories_num\n",
"- Specify to exclude the state '0' (zero). State zero in our features represents 'any other state', including NULL\n",
"<br/>	 → exclude_zero = False"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"max_categories_frac = 0.90\n",
"min_categories_num = 1\n",
"exclude_zero = False # if possible remove state zero\n",
"\n",
"categories_dic, labels_dic, dtypes_dic, features_types_group[\"DUMMIES\"] = \\\n",
" factorise_settings(max_categories_frac, min_categories_num, exclude_zero)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Manually add dummy variables to the dataframe & remove the original Categorical variables"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"features[\"train_indep_temp\"] = preprocess.factoring_feature_wise(features[\"train_indep\"], categories_dic, labels_dic, dtypes_dic, threaded=False)\n",
"features[\"test_indep_temp\"] = preprocess.factoring_feature_wise(features[\"test_indep\"], categories_dic, labels_dic, dtypes_dic, threaded=False)\n",
"\n",
"# print\n",
"print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
"print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Verify features visually"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"display(pd.concat([features[\"train_id\"].head(), features[\"train_target\"].head(), features[\"train_indep_temp\"].head()], axis=1))\n",
"display(pd.concat([features[\"test_id\"].head(), features[\"test_target\"].head(), features[\"test_indep_temp\"].head()], axis=1))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Set"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"features[\"train_indep\"] = features[\"train_indep_temp\"].copy(True)\n",
"features[\"test_indep\"] = features[\"test_indep_temp\"].copy(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:red\">Clean-Up</font>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"del features[\"train_indep_temp\"]\n",
"del features[\"test_indep_temp\"]\n",
"gc.collect()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 6.2. Remove - Near Zero Variance"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Optional: Remove more features with near zero variance, after the factorisation step.\n",
"<font style=\"font-weight:bold;color:red\">Configure:</font> the function"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# the cutoff for the percentage of distinct values out of the number of total samples (upper limit). e.g. 10 * 100 / 100\n",
"thresh_unique_cut = 100\n",
"# the cutoff for the ratio of the most common value to the second most common value (lower limit). eg. 95/5\n",
"thresh_freq_cut = 1000\n",
"\n",
"excludes = []\n",
"file_name = \"Step_06_Preprocess_NZV_config\"\n",
"features[\"train_indep\"], o_summaries = preprocess.near_zero_var_df(df=features[\"train_indep\"], \n",
" excludes=excludes, \n",
" file_name=file_name, \n",
" thresh_unique_cut=thresh_unique_cut, \n",
" thresh_freq_cut=thresh_freq_cut,\n",
" to_search=True)\n",
"\n",
"file_name = \"Step_06_Preprocess_NZV\"\n",
"readers_writers.save_text(path=CONSTANTS.io_path, title=file_name, data=o_summaries, append=False, ext=\"log\")\n",
"\n",
"file_name = \"Step_06_Preprocess_NZV_config\"\n",
"features[\"test_indep\"], o_summaries = preprocess.near_zero_var_df(df=features[\"test_indep\"], \n",
" excludes=excludes, \n",
" file_name=file_name, \n",
" thresh_unique_cut=thresh_unique_cut, \n",
" thresh_freq_cut=thresh_freq_cut,\n",
" to_search=False)\n",
"\n",
"# print\n",
"print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
"print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 6.3. Remove Highly Linearly Correlated"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Optional: Remove more features with highly linearly correlated, after the factorisation step.\n",
"<font style=\"font-weight:bold;color:red\">Configure:</font> the function"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# A numeric value for the pair-wise absolute correlation cutoff. e.g. 0.95\n",
"thresh_corr_cut = 0.95\n",
"\n",
"excludes = []\n",
"file_name = \"Step_06_Preprocess_Corr_config\"\n",
"features[\"train_indep\"], o_summaries = preprocess.high_linear_correlation_df(df=features[\"train_indep\"], \n",
" excludes=excludes, \n",
" file_name=file_name, \n",
" thresh_corr_cut=thresh_corr_cut,\n",
" to_search=True)\n",
"\n",
"file_name = \"Step_06_Preprocess_Corr\"\n",
"readers_writers.save_text(path=CONSTANTS.io_path, title=file_name, data=o_summaries, append=False, ext=\"log\")\n",
"\n",
"file_name = \"Step_06_Preprocess_Corr_config\"\n",
"features[\"test_indep\"], o_summaries = preprocess.high_linear_correlation_df(df=features[\"test_indep\"], \n",
" excludes=excludes, \n",
" file_name=file_name, \n",
" thresh_corr_cut=thresh_corr_cut,\n",
" to_search=False)\n",
"\n",
"# print\n",
"print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
"print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 6.4. Descriptive Statsistics"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Produce a descriptive stat report of 'Categorical', 'Continuous', & 'TARGET' features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# columns\n",
"file_name = \"Step_06_4_Data_ColumnNames_Train\"\n",
"readers_writers.save_csv(path=CONSTANTS.io_path, title=file_name, \n",
" data=list(features[\"train_indep\"].columns.values), append=False)\n",
"\n",
"# Sample - Train\n",
"file_name = \"Step_06_4_Stats_Categorical_Train\"\n",
"o_stats = preprocess.stats_discrete_df(df=features[\"train_indep\"], includes=features_types_group[\"CATEGORICAL\"], \n",
" file_name=file_name)\n",
"file_name = \"Step_06_4_Stats_Continuous_Train\"\n",
"o_stats = preprocess.stats_continuous_df(df=features[\"train_indep\"], includes=features_types_group[\"CONTINUOUS\"], \n",
" file_name=file_name)\n",
"\n",
"# Sample - Test\n",
"file_name = \"Step_06_4_Stats_Categorical_Test\"\n",
"o_stats = preprocess.stats_discrete_df(df=features[\"test_indep\"], includes=features_types_group[\"CATEGORICAL\"],\n",
" file_name=file_name)\n",
"file_name = \"Step_06_4_Stats_Continuous_Test\"\n",
"o_stats = preprocess.stats_continuous_df(df=features[\"test_indep\"], includes=features_types_group[\"CONTINUOUS\"], \n",
" file_name=file_name)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"### 6.5. Transformations"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Verify features visually"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"display(pd.concat([features[\"train_id\"].head(), features[\"train_target\"].head(), features[\"train_indep\"].head()], axis=1))\n",
"display(pd.concat([features[\"test_id\"].head(), features[\"test_target\"].head(), features[\"test_indep\"].head()], axis=1))"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:blue\">Tranformation:</font> scale\n",
"<font style=\"font-weight:bold;color:brown\">Note:</font>: It is highly resource intensive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"transform_type = \"scale\"\n",
"kwargs = {\"with_mean\": True}\n",
"method_args = dict()\n",
"excludes = list(features_types_group[\"CATEGORICAL\"]) + list(features_types_group[\"DUMMIES\"])\n",
"\n",
"features[\"train_indep\"], method_args = preprocess.transform_df(df=features[\"train_indep\"], excludes=excludes, \n",
" transform_type=transform_type, threaded=False, \n",
" method_args=method_args, **kwargs)\n",
"features[\"test_indep\"], _ = preprocess.transform_df(df=features[\"test_indep\"], excludes=excludes, \n",
" transform_type=transform_type, threaded=False, \n",
" method_args=method_args, **kwargs)\n",
"\n",
"# print(\"Metod arguments:\", method_args)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:blue\">Tranformation:</font> Yeo-Johnson\n",
"<font style=\"font-weight:bold;color:brown\">Note:</font>: It is highly resource intensive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"transform_type = \"yeo_johnson\"\n",
"kwargs = {\"lmbda\": -0.5, \"derivative\": 0, \"epsilon\": np.finfo(np.float).eps, \"inverse\": False}\n",
"method_args = dict()\n",
"excludes = list(features_types_group[\"CATEGORICAL\"]) + list(features_types_group[\"DUMMIES\"])\n",
"\n",
"features[\"train_indep\"], method_args = preprocess.transform_df(df=features[\"train_indep\"], excludes=excludes, \n",
" transform_type=transform_type, threaded=False, \n",
" method_args=method_args, **kwargs)\n",
"features[\"test_indep\"], _ = preprocess.transform_df(df=features[\"test_indep\"], excludes=excludes, \n",
" transform_type=transform_type, threaded=False, \n",
" method_args=method_args, **kwargs)\n",
"\n",
"# print(\"Metod arguments:\", method_args)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Visual verification"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"display(pd.concat([features[\"train_id\"].head(), features[\"train_target\"].head(), features[\"train_indep\"].head()], axis=1))\n",
"display(pd.concat([features[\"test_id\"].head(), features[\"test_target\"].head(), features[\"test_indep\"].head()], axis=1))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 6.6. Summary Statistics"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Produce a descriptive stat report of 'Categorical', 'Continuous', & 'TARGET' features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# Statsistics report for 'Categorical', 'Continuous', & 'TARGET' variables\n",
"# columns\n",
"file_name = \"Step_06_6_Data_ColumnNames_Train\"\n",
"readers_writers.save_csv(path=CONSTANTS.io_path, title=file_name, \n",
" data=list(features[\"train_indep\"].columns.values), append=False)\n",
"\n",
"# Sample - Train\n",
"file_name = \"Step_06_6_Stats_Categorical_Train\"\n",
"o_stats = preprocess.stats_discrete_df(df=features[\"train_indep\"], includes=features_types_group[\"CATEGORICAL\"], \n",
" file_name=file_name)\n",
"file_name = \"Step_06_6_Stats_Continuous_Train\"\n",
"o_stats = preprocess.stats_continuous_df(df=features[\"train_indep\"], includes=features_types_group[\"CONTINUOUS\"], \n",
" file_name=file_name)\n",
"\n",
"# Sample - Test\n",
"file_name = \"Step_06_6_Stats_Categorical_Test\"\n",
"o_stats = preprocess.stats_discrete_df(df=features[\"test_indep\"], includes=features_types_group[\"CATEGORICAL\"],\n",
" file_name=file_name)\n",
"file_name = \"Step_06_6_Stats_Continuous_Test\"\n",
"o_stats = preprocess.stats_continuous_df(df=features[\"test_indep\"], includes=features_types_group[\"CONTINUOUS\"], \n",
" file_name=file_name)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## 7. Rank & Select Features"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:red\">Configure:</font> the general settings"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# select the target variable\n",
"target_feature = \"label365\" # \"label30\", \"label365\"\n",
"\n",
"# number of trials\n",
"num_trials = 1\n",
"\n",
"model_rank = dict()\n",
"o_summaries_df = dict()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 7.1. Define"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:blue\">Ranking Method:</font> Random forest classifier (Brieman)\n",
"<br/>Define a set of classifiers with different settings, to be used in feature ranking trials."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"def rank_random_forest_brieman(features_indep_arg, features_target_arg, num_trials):\n",
" num_settings = 3\n",
" o_summaries_df = [pd.DataFrame({'Name': list(features_indep_arg.columns.values)}) for _ in range(num_trials * num_settings)]\n",
" model_rank = [None] * (num_trials * num_settings)\n",
"\n",
" # trials \n",
" for i in range(num_trials): \n",
" print(\"Trial: \" + str(i))\n",
" # setting-1\n",
" s_i = i\n",
" model_rank[s_i] = feature_selection.rank_random_forest_breiman(\n",
" features_indep_arg.values, features_target_arg.values,\n",
" **{\"n_estimators\": 10, \"criterion\": 'gini', \"max_depth\": None, \"min_samples_split\": 2, \"min_samples_leaf\": 1,\n",
" \"min_weight_fraction_leaf\": 0.0, \"max_features\": 'auto', \"max_leaf_nodes\": None, \"bootstrap\": True,\n",
" \"oob_score\": False, \"n_jobs\": -1, \"random_state\": None, \"verbose\": 0, \"warm_start\": False, \"class_weight\": None})\n",
"\n",
" # setting-2\n",
" s_i = num_trials + i\n",
" model_rank[s_i] = feature_selection.rank_random_forest_breiman(\n",
" features_indep_arg.values, features_target_arg.values,\n",
" **{\"n_estimators\": 10, \"criterion\": 'gini', \"max_depth\": None, \"min_samples_split\": 50, \"min_samples_leaf\": 25,\n",
" \"min_weight_fraction_leaf\": 0.0, \"max_features\": 'auto', \"max_leaf_nodes\": None, \"bootstrap\": True,\n",
" \"oob_score\": False, \"n_jobs\": -1, \"random_state\": None, \"verbose\": 0, \"warm_start\": False, \"class_weight\": None})\n",
"\n",
" # setting-3\n",
" s_i = (num_trials * 2) + i\n",
" model_rank[s_i] = feature_selection.rank_random_forest_breiman(\n",
" features_indep_arg.values, features_target_arg.values,\n",
" **{\"n_estimators\": 10, \"criterion\": 'gini', \"max_depth\": None, \"min_samples_split\": 40, \"min_samples_leaf\": 20,\n",
" \"min_weight_fraction_leaf\": 0.0, \"max_features\": 'auto', \"max_leaf_nodes\": None, \"bootstrap\": True,\n",
" \"oob_score\": False, \"n_jobs\": -1, \"random_state\": None, \"verbose\": 0, \"warm_start\": True, \"class_weight\": None})\n",
"\n",
" for i in range((num_trials * num_settings)):\n",
" o_summaries_df[i]['Importance'] = list(model_rank[i].feature_importances_)\n",
" o_summaries_df[i] = o_summaries_df[i].sort_values(['Importance'], ascending = [0])\n",
" o_summaries_df[i] = o_summaries_df[i].reset_index(drop = True)\n",
" o_summaries_df[i]['Order'] = range(1, len(o_summaries_df[i]['Importance']) + 1)\n",
" return model_rank, o_summaries_df"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:blue\">Ranking Method:</font> Gradient Boosted Regression Trees (GBRT) \n",
"<br/>Define a set of classifiers with different settings, to be used in feature ranking trials."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"def rank_gbrt(features_indep_arg, features_target_arg, num_trials):\n",
" num_settings = 3\n",
" o_summaries_df = [pd.DataFrame({'Name': list(features_indep_arg.columns.values)}) for _ in range(num_trials * num_settings)]\n",
" model_rank = [None] * (num_trials * num_settings)\n",
"\n",
" # trials \n",
" for i in range(num_trials): \n",
" print(\"Trial: \" + str(i))\n",
" # setting-1\n",
" s_i = i\n",
" model_rank[s_i] = feature_selection.rank_tree_gbrt(\n",
" features_indep_arg.values, features_target_arg.values, \n",
" **{\"loss\": 'ls', \"learning_rate\": 0.1, \"n_estimators\": 100, \"subsample\": 1.0, \"min_samples_split\": 2, \"min_samples_leaf\": 1,\n",
" \"min_weight_fraction_leaf\": 0.0, \"max_depth\": 10, \"init\": None, \"random_state\": None, \"max_features\": None, \"alpha\": 0.9,\n",
" \"verbose\": 0, \"max_leaf_nodes\": None, \"warm_start\": False, \"presort\": True})\n",
" \n",
" # setting-2\n",
" s_i = num_trials + i\n",
" model_rank[s_i] = feature_selection.rank_tree_gbrt(\n",
" features_indep_arg.values, features_target_arg.values,\n",
" **{\"loss\": 'ls', \"learning_rate\": 0.1, \"n_estimators\": 100, \"subsample\": 1.0, \"min_samples_split\": 2, \"min_samples_leaf\": 1,\n",
" \"min_weight_fraction_leaf\": 0.0, \"max_depth\": 5, \"init\": None, \"random_state\": None, \"max_features\": None, \"alpha\": 0.9,\n",
" \"verbose\": 0, \"max_leaf_nodes\": None, \"warm_start\": False, \"presort\": True})\n",
"\n",
" # setting-3\n",
" s_i = (num_trials * 2) + i\n",
" model_rank[s_i] = feature_selection.rank_tree_gbrt(\n",
" features_indep_arg.values, features_target_arg.values,\n",
" **{\"loss\": 'ls', \"learning_rate\": 0.1, \"n_estimators\": 100, \"subsample\": 1.0, \"min_samples_split\": 2, \"min_samples_leaf\": 1,\n",
" \"min_weight_fraction_leaf\": 0.0, \"max_depth\": 3, \"init\": None, \"random_state\": None, \"max_features\": None, \"alpha\": 0.9,\n",
" \"verbose\": 0, \"max_leaf_nodes\": None, \"warm_start\": False, \"presort\": True})\n",
"\n",
" for i in range((num_trials * num_settings)):\n",
" o_summaries_df[i]['Importance'] = list(model_rank[i].feature_importances_)\n",
" o_summaries_df[i] = o_summaries_df[i].sort_values(['Importance'], ascending = [0])\n",
" o_summaries_df[i] = o_summaries_df[i].reset_index(drop = True)\n",
" o_summaries_df[i]['Order'] = range(1, len(o_summaries_df[i]['Importance']) + 1)\n",
" return model_rank, o_summaries_df"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:blue\">Ranking Method:</font> Randomized Logistic Regression\n",
"<br/>Define a set of classifiers with different settings, to be used in feature ranking trials."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"def rank_randLogit(features_indep_arg, features_target_arg, num_trials):\n",
" num_settings = 3\n",
" o_summaries_df = [pd.DataFrame({'Name': list(features_indep_arg.columns.values)}) for _ in range(num_trials * num_settings)]\n",
" model_rank = [None] * (num_trials * num_settings)\n",
"\n",
" # trials \n",
" for i in range(num_trials): \n",
" print(\"Trial: \" + str(i))\n",
" # setting-1\n",
" s_i = i\n",
" model_rank[s_i] = feature_selection.rank_random_logistic_regression(\n",
" features_indep_arg.values, features_target_arg.values,\n",
" **{\"C\": 1, \"scaling\": 0.5, \"sample_fraction\": 0.75, \"n_resampling\": 200, \"selection_threshold\": 0.25, \"tol\": 0.001,\n",
" \"fit_intercept\": True, \"verbose\": False, \"normalize\": True, \"random_state\": None, \"n_jobs\": 1, \"pre_dispatch\": '3*n_jobs'})\n",
"\n",
" # setting-2\n",
" s_i = num_trials + i\n",
" model_rank[s_i] = feature_selection.rank_random_logistic_regression(\n",
" features_indep_arg.values, features_target_arg.values,\n",
" **{\"C\": 1, \"scaling\": 0.5, \"sample_fraction\": 0.50, \"n_resampling\": 200, \"selection_threshold\": 0.25, \"tol\": 0.001,\n",
" \"fit_intercept\": True, \"verbose\": False, \"normalize\": True, \"random_state\": None, \"n_jobs\": 1, \"pre_dispatch\": '3*n_jobs'})\n",
"\n",
" # setting-3\n",
" s_i = (num_trials * 2) + i\n",
" model_rank[s_i] = feature_selection.rank_random_logistic_regression(\n",
" features_indep_arg.values, features_target_arg.values,\n",
" **{\"C\": 1, \"scaling\": 0.5, \"sample_fraction\": 0.90, \"n_resampling\": 200, \"selection_threshold\": 0.25, \"tol\": 0.001,\n",
" \"fit_intercept\": True, \"verbose\": False, \"normalize\": True, \"random_state\": None, \"n_jobs\": 1, \"pre_dispatch\": '3*n_jobs'})\n",
" \n",
" for i in range((num_trials * num_settings)):\n",
" o_summaries_df[i]['Importance'] = list(model_rank[i].scores_)\n",
" o_summaries_df[i] = o_summaries_df[i].sort_values(['Importance'], ascending = [0])\n",
" o_summaries_df[i] = o_summaries_df[i].reset_index(drop = True)\n",
" o_summaries_df[i]['Order'] = range(1, len(o_summaries_df[i]['Importance']) + 1)\n",
" return model_rank, o_summaries_df"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 7.2. Run"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Run one or more feature ranking methods and trials"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:blue\">Ranking Method:</font> Random forest classifier (Brieman)\n",
"<font style=\"font-weight:bold;color:brown\">Note:</font>: It is moderately resource intensive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"rank_model = \"rfc\"\n",
"model_rank[rank_model] = dict() \n",
"o_summaries_df[rank_model] = dict() \n",
"model_rank[rank_model], o_summaries_df[rank_model] = rank_random_forest_brieman(\n",
" features[\"train_indep\"], features[\"train_target\"][target_feature], num_trials)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:blue\">Ranking Method:</font> Gradient Boosted Regression Trees (GBRT)\n",
"<font style=\"font-weight:bold;color:brown\">Note:</font>: It is moderately resource intensive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"rank_model = \"gbrt\"\n",
"model_rank[rank_model] = dict() \n",
"o_summaries_df[rank_model] = dict() \n",
"model_rank[rank_model], o_summaries_df[rank_model] = rank_gbrt(\n",
" features[\"train_indep\"], features[\"train_target\"][target_feature], num_trials)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:blue\">Ranking Method</font>: Randomized Logistic Regression\n",
"<font style=\"font-weight:bold;color:brown\">Note:</font>: It is moderately resource intensive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"rank_model = \"randLogit\"\n",
"model_rank[rank_model] = dict() \n",
"o_summaries_df[rank_model] = dict() \n",
"model_rank[rank_model], o_summaries_df[rank_model] = rank_randLogit(\n",
" features[\"train_indep\"], features[\"train_target\"][target_feature], num_trials)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 7.3. Summaries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# combine scores\n",
"def rank_summarise (features_arg, o_summaries_df_arg):\n",
" summaries_temp = {'Order_avg': [], 'Order_max': [], 'Order_min': [], 'Importance_avg': []}\n",
" summary_order = []\n",
" summary_importance = []\n",
" \n",
" for f_name in list(features_arg.columns.values):\n",
" for i in range(len(o_summaries_df_arg)):\n",
" summary_order.append(o_summaries_df_arg[i][o_summaries_df_arg[i]['Name'] == f_name]['Order'].values)\n",
" summary_importance.append(o_summaries_df_arg[i][o_summaries_df_arg[i]['Name'] == f_name]['Importance'].values)\n",
"\n",
" summaries_temp['Order_avg'].append(statistics.mean(np.concatenate(summary_order)))\n",
" summaries_temp['Order_max'].append(max(np.concatenate(summary_order)))\n",
" summaries_temp['Order_min'].append(min(np.concatenate(summary_order)))\n",
" summaries_temp['Importance_avg'].append(statistics.mean(np.concatenate(summary_importance)))\n",
"\n",
" summaries_df = pd.DataFrame({'Name': list(features_arg.columns.values)})\n",
" summaries_df['Order_avg'] = summaries_temp['Order_avg']\n",
" summaries_df['Order_max'] = summaries_temp['Order_max']\n",
" summaries_df['Order_min'] = summaries_temp['Order_min']\n",
" summaries_df['Importance_avg'] = summaries_temp['Importance_avg']\n",
" summaries_df = summaries_df.sort_values(['Order_avg'], ascending = [1])\n",
" return summaries_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# combine scores\n",
"summaries_df = dict()\n",
"\n",
"for rank_model in o_summaries_df.keys():\n",
" summaries_df[rank_model] = dict()\n",
" summaries_df[rank_model] = rank_summarise(features[\"train_indep\"], o_summaries_df[rank_model])"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Save"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"for rank_model in model_rank.keys():\n",
" file_name = \"Step_07_Model_Train_model_rank_\" + rank_model\n",
" readers_writers.save_serialised_compressed(path=CONSTANTS.io_path, title=file_name, objects=model_rank[rank_model])\n",
" \n",
" file_name = \"Step_07_Model_Train_model_rank_summaries_\" + rank_model\n",
" readers_writers.save_serialised_compressed(path=CONSTANTS.io_path, title=file_name, objects=o_summaries_df[rank_model])"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 7.4. Select Top Features"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:red\">Configure:</font> the selection method"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"rank_model = \"rfc\"\n",
"file_name = \"Step_07_Top_Features_\" + rank_model\n",
"rank_top_features_max = 400\n",
"rank_top_features_score_min = 0.1 * (10 ^ -20)\n",
"\n",
"# sort features\n",
"features_names_selected = summaries_df[rank_model]['Name'][summaries_df[rank_model]['Order_avg'] >= rank_top_features_score_min]\n",
"features_names_selected = (features_names_selected[0:rank_top_features_max]).tolist()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Save"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# save to CSV\n",
"readers_writers.save_csv(path=CONSTANTS.io_path, title=file_name, data=features_names_selected, append=False, header=False)\n",
"\n",
"# print \n",
"print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
"print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")\n",
"print(\"List of sorted features, which can be modified:\\n \" + CONSTANTS.io_path + file_name + \"csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:red\">Configure</font>: the selected feature manually if it isnecessary!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"file_name = \"Step_07_Top_Features_rfc_adhoc\" \n",
"\n",
"features_names_selected = readers_writers.load_csv(path=CONSTANTS.io_path, title=file_name, dataframing=False)[0]\n",
"features_names_selected = [f.replace(\"\\n\", \"\") for f in features_names_selected]\n",
"display(pd.DataFrame(features_names_selected))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Verify the top features visually"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# print \n",
"print(\"Number of columns: \", len(features[\"train_indep\"].columns), \n",
" \";\\nNumber of top columns: \", len(features[\"train_indep\"][features_names_selected].columns)) \n",
"print(\"features: {train: \", len(features[\"train_indep\"][features_names_selected]), \", test: \", len(features[\"test_indep\"][features_names_selected]), \"}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 7.5. Summary Statistics"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Produce a descriptive stat report of 'Categorical', 'Continuous', & 'TARGET' features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# columns\n",
"file_name = \"Step_07_Data_ColumnNames_Train\"\n",
"readers_writers.save_csv(path=CONSTANTS.io_path, title=file_name, \n",
" data=list(features[\"train_indep\"][features_names_selected].columns.values), append=False)\n",
"\n",
"# Sample - Train\n",
"file_name = \"Step_07_Stats_Categorical_Train\"\n",
"o_stats = preprocess.stats_discrete_df(df=features[\"train_indep\"][features_names_selected], includes=features_types_group[\"CATEGORICAL\"], \n",
" file_name=file_name)\n",
"file_name = \"Step_07_Stats_Continuous_Train\"\n",
"o_stats = preprocess.stats_continuous_df(df=features[\"train_indep\"][features_names_selected], includes=features_types_group[\"CONTINUOUS\"], \n",
" file_name=file_name)\n",
"\n",
"# Sample - Test\n",
"file_name = \"Step_07_Stats_Categorical_Test\"\n",
"o_stats = preprocess.stats_discrete_df(df=features[\"test_indep\"][features_names_selected], includes=features_types_group[\"CATEGORICAL\"],\n",
" file_name=file_name)\n",
"file_name = \"Step_07_Stats_Continuous_Test\"\n",
"o_stats = preprocess.stats_continuous_df(df=features[\"test_indep\"][features_names_selected], includes=features_types_group[\"CONTINUOUS\"], \n",
" file_name=file_name)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 7.6. Save Features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"file_name = \"Step_07_Features\"\n",
"readers_writers.save_serialised_compressed(path=CONSTANTS.io_path, title=file_name, objects=features)\n",
"\n",
"# print \n",
"print(\"File size: \", os.stat(os.path.join(CONSTANTS.io_path, file_name + \".bz2\")).st_size)\n",
"print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
"print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## 8. Model"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:orange\">Load a Saved Samples and Features Ranking:</font> \n",
"<br/> It is an optional step. The step loads the serialised & compressed outputs of Step-7."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# open fetures\n",
"file_name = \"Step_07_Features\"\n",
"features = readers_writers.load_serialised_compressed(path=CONSTANTS.io_path, title=file_name)\n",
"\n",
"# print \n",
"print(\"File size: \", os.stat(os.path.join(CONSTANTS.io_path, file_name + \".bz2\")).st_size)\n",
"print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
"print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# open scoring model files\n",
"rank_models = [\"rfc\", \"gbrt\", \"randLogit\"]\n",
"model_rank = dict()\n",
"o_summaries_df = dict()\n",
"\n",
"for rank_model in rank_models:\n",
" file_name = \"Step_07_Model_Train_model_rank_\" + rank_model\n",
" if not readers_writers.exists_serialised(path=CONSTANTS.io_path, title=file_name, ext=\"bz2\"):\n",
" continue\n",
"\n",
" file_name = \"Step_07_Model_Train_model_rank_\" + rank_model\n",
" model_rank[rank_model] = readers_writers.load_serialised_compressed(path=CONSTANTS.io_path, title=file_name)\n",
"\n",
" file_name = \"Step_07_Model_Train_model_rank_summaries_\" + rank_model\n",
" o_summaries_df[rank_model] = readers_writers.load_serialised_compressed(path=CONSTANTS.io_path, title=file_name)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Verify features visually"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [],
"source": [
"display(pd.concat([features[\"train_id\"].head(), features[\"train_target\"].head(), features[\"train_indep\"].head()], axis=1))\n",
"display(pd.concat([features[\"test_id\"].head(), features[\"test_target\"].head(), features[\"test_indep\"].head()], axis=1))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 8.1. Initialise"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### 8.1.1. Algorithms"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:red\">Configure:</font> the trianing algorithm"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:brown\">Algorithm 1</font>: Random Forest"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [],
"source": [
"method_name = \"rfc\"\n",
"kwargs = {\"n_estimators\": 20, \"criterion\": 'gini', \"max_depth\": None, \"min_samples_split\": 100,\n",
" \"min_samples_leaf\": 50, \"min_weight_fraction_leaf\": 0.0, \"max_features\": 'auto',\n",
" \"max_leaf_nodes\": None, \"bootstrap\": True, \"oob_score\": False, \"n_jobs\": -1, \"random_state\": None,\n",
" \"verbose\": 0, \"warm_start\": False, \"class_weight\": \"balanced_subsample\"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:brown\">Algorithm 2</font>: Logistic Regression"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [],
"source": [
"method_name = \"lr\"\n",
"kwargs = {\"penalty\": 'l1', \"dual\": False, \"tol\": 0.0001, \"C\": 1, \"fit_intercept\": True, \"intercept_scaling\": 1,\n",
" \"class_weight\": None, \"random_state\": None, \"solver\": 'liblinear', \"max_iter\": 100, \"multi_class\": 'ovr',\n",
" \"verbose\": 0, \"warm_start\": False, \"n_jobs\": -1}"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:brown\">Algorithm 3</font>: Logistic Cross-Validation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"method_name = \"lr_cv\"\n",
"kwargs = {\"Cs\": 10, \"fit_intercept\": True, \"cv\": None, \"dual\": False, \"penalty\": 'l2', \"scoring\": None, \n",
" \"solver\": 'lbfgs', \"tol\": 0.0001, \"max_iter\": 10, \"class_weight\": None, \"n_jobs\": -1, \"verbose\": 0, \n",
" \"refit\": True, \"intercept_scaling\": 1.0, \"multi_class\": \"ovr\", \"random_state\": None}"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:brown\">Algorithm 4</font>: Neural Network"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"method_name = \"nn\"\n",
"kwargs = {\"solver\": 'lbfgs', \"alpha\": 1e-5, \"hidden_layer_sizes\": (5, 2), \"random_state\": 1}"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:brown\">Algorithm 5</font>: k-Nearest Neighbourhood"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"method_name = \"knc\"\n",
"kwargs = {\"n_neighbors\": 5, \"weights\": 'distance', \"algorithm\": 'auto', \"leaf_size\": 30,\n",
" \"p\": 2, \"metric\": 'minkowski', \"metric_params\": None, \"n_jobs\": -1}"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:brown\">Algorithm 6</font>: Decision Tree"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"method_name = \"dtc\"\n",
"kwargs = {\"criterion\": 'gini', \"splitter\": 'best', \"max_depth\": None, \"min_samples_split\": 30,\n",
" \"min_samples_leaf\": 30, \"min_weight_fraction_leaf\": 0.0, \"max_features\": None,\n",
" \"random_state\": None, \"max_leaf_nodes\": None, \"class_weight\": None, \"presort\": False}"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:brown\">Algorithm 7</font>: Gradient Boosting Classifier"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"method_name = \"gbc\"\n",
"kwargs = {\"loss\": 'deviance', \"learning_rate\": 0.1, \"n_estimators\": 100, \"subsample\": 1.0, \"min_samples_split\": 30,\n",
" \"min_samples_leaf\": 30, \"min_weight_fraction_leaf\": 0.0, \"max_depth\": 3, \"init\": None, \"random_state\": None,\n",
" \"max_features\": None, \"verbose\": 0, \"max_leaf_nodes\": None, \"warm_start\": False, \"presort\": 'auto'}"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:brown\">Algorithm 8</font>: Naive Bayes<br/>\n",
"Note: features must be positive"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [],
"source": [
"method_name = \"nb\"\n",
"training_method = TrainingMethod(method_name)\n",
"kwargs = {\"alpha\": 1.0, \"fit_prior\": True, \"class_prior\": None}"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### 8.1.2. Other Settings"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<font style=\"font-weight:bold;color:red\">Configure:</font> other modelling settings"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# select the target variable\n",
"target_feature = \"label365\" # \"label30\" , \"label365\" \n",
"\n",
"# file name\n",
"file_name = \"Step_09_Model_\" + method_name + \"_\" + target_feature\n",
"\n",
"# initialise\n",
"training_method = TrainingMethod(method_name)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"#### 8.1.3. Features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"sample_train = features[\"train_indep\"][features_names_selected] # features[\"train_indep\"][features_names_selected], features[\"train_indep\"]\n",
"sample_train_target = features[\"train_target\"][target_feature] # features[\"train_target\"][target_feature]\n",
"sample_test = features[\"test_indep\"][features_names_selected] # features[\"test_indep\"][features_names_selected], features[\"test_indep\"]\n",
"sample_test_target = features[\"test_target\"][target_feature] # features[\"test_target\"][target_feature]"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 8.3. Fit"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Fit the model, using the train sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": false
},
"outputs": [],
"source": [
"o_summaries = dict()\n",
"# Fit\n",
"model = training_method.train(sample_train, sample_train_target, **kwargs)\n",
"training_method.save_model(path=CONSTANTS.io_path, title=file_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# load model\n",
"# training_method.load(path=CONSTANTS.io_path, title=file_name)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# short summary\n",
"o_summaries = training_method.train_summaries()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Predict & report performance, using the train sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": true
},
"outputs": [],
"source": [
"o_summaries = dict()\n",
"# predict\n",
"model = training_method.predict(sample_train, \"train\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# short summary\n",
"o_summaries = training_method.predict_summaries(pd.Series(sample_train_target), \"train\")\n",
"\n",
"# Print the main performance statistics\n",
"for k in o_summaries.keys():\n",
" print(k, o_summaries[k])\n",
"\n",
"# Print the by risk-bands of a selection of statistics\n",
"o_summaries = training_method.predict_summaries_risk_bands(pd.Series(sample_train_target), \"train\", np.arange(0, 1.05, 0.05))\n",
"display(o_summaries)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 8.4. Predict"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Predict & report performance, using the test sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": false
},
"outputs": [],
"source": [
"o_summaries = dict()\n",
"# predict\n",
"model = training_method.predict(sample_test, \"test\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": false
},
"outputs": [],
"source": [
"# short summary\n",
"o_summaries = training_method.predict_summaries(pd.Series(sample_test_target), \"test\")\n",
"\n",
"# Print the main performance statistics\n",
"for k in o_summaries.keys():\n",
" print(k, o_summaries[k])\n",
"\n",
"# Print the by risk-bands of a selection of statistics\n",
"o_summaries = training_method.predict_summaries_risk_bands(pd.Series(sample_test_target), \"test\", np.arange(0, 1.05, 0.05))\n",
"display(o_summaries)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"source": [
"### 8.5. Cross-Validation"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Perform k-fold cross-validation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"o_summaries = dict()\n",
"score = training_method.cross_validate(sample_test, sample_test_target, scoring=\"neg_mean_squared_error\", cv=10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"# short summary\n",
"o_summaries = training_method.cross_validate_summaries()\n",
"print(\"Scores: \", o_summaries)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### 8.6. Save"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Save the training model. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"scrolled": false
},
"outputs": [],
"source": [
"training_method.save_model(path=CONSTANTS.io_path, title=file_name)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<br/><br/>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Fin!"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}