T-CARER / Git / [973ab6] /TCARER

Models:
RaymondKing/
T-CARER
Downloads: 1
[973ab6]: / TCARER_TensorFlow.ipynb
History
Download this file
1701 lines (1700 with data), 44.1 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "# Temporal-Comorbidity Adjusted Risk of Emergency Readmission (TCARER)\n",
    "## <font style=\"font-weight:bold;color:gray\">Wide &amp; Deep Neural Network (WDNN) Model</font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "[1. Initialise](#1.-Initialise)\n",
    "<br\\>\n",
    "[2. Read Data &amp; Store CSV](#2.-Read-Data-&amp;-Store-CSV)\n",
    "<br\\>\n",
    "[3. Set TensorFlow Settings](#3.-Set-TensorFlow-Settings)\n",
    "<br\\>\n",
    "[4. Model](#4.-Model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "This Jupyter iPython Notebook applies the Temporal-Comorbidity Adjusted Risk of Emergency Readmission (TCARER).\n",
    "\n",
    "This Notebook extract aggregated features from the MySQL database, &amp; then pre-process, configure &amp; apply a Wide &amp; Deep Neural Network (WDNN) model. \n",
    "\n",
    "Note that some of the scripts are optional or subject to some pre-configurations. Please refer to the comments &amp; the project documentations for further details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<hr\\>\n",
    "<font size=\"1\" color=\"gray\">Copyright 2017 The Project Authors. All Rights Reserved.\n",
    "\n",
    "It is licensed under the Apache License, Version 2.0. you may not use this file except in compliance with the License. You may obtain a copy of the License at\n",
    "\n",
    "  <a href=\"http://www.apache.org/licenses/LICENSE-2.0\">http://www.apache.org/licenses/LICENSE-2.0</a>\n",
    "\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.</font>\n",
    "<hr\\>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## 1.  Initialise"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Reload modules \n",
    "# It is an optional step. It is useful to run when external Python modules are being modified\n",
    "# It is reloading all modules (except those excluded by %aimport) every time before executing the Python code typed.\n",
    "# Note: It may conflict with serialisation, when external modules are being modified\n",
    "\n",
    "# %load_ext autoreload \n",
    "# %autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Import Python libraries\n",
    "import logging\n",
    "import os\n",
    "import sys\n",
    "import gc\n",
    "import pandas as pd\n",
    "from IPython.display import display, HTML\n",
    "from collections import OrderedDict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Import local Python modules\n",
    "from Configs.CONSTANTS import CONSTANTS\n",
    "from Configs.Logger import Logger\n",
    "from Features.Variables import Variables\n",
    "from ReadersWriters.ReadersWriters import ReadersWriters\n",
    "from Stats.PreProcess import PreProcess\n",
    "from Stats.FeatureSelection import FeatureSelection\n",
    "from Stats.TrainingMethod import TrainingMethod\n",
    "from Stats.Plots import Plots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "# Check the interpreter\n",
    "print(\"\\nMake sure the correct Python interpreter is used!\")\n",
    "print(sys.version)\n",
    "print(\"\\nMake sure sys.path of the Python interpreter is correct!\")\n",
    "print(os.getcwd())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Import the Tensorflow libraries &amp; check the version &amp local devices"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Tensorflow\n",
    "import tensorflow as tf \n",
    "import tempfile\n",
    "from tensorflow.python.client import device_lib \n",
    "\n",
    "print(tf.__version__)\n",
    "print(device_lib.list_local_devices())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### 1.1.  Initialise General Settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "config_path = os.path.abspath(\"ConfigInputs/CONFIGURATIONS.ini\")\n",
    "io_path = os.path.abspath(\"../../tmp/TCARER/Basic_prototype\")\n",
    "app_name = \"T-CARER\"\n",
    "submodel_name = \"hesIp\"\n",
    "submodel_input_name = \"tcarer_model_features_ip\"\n",
    "\n",
    "print(\"\\n The full path of the configuration file: \\n\\t\", config_path,\n",
    "      \"\\n The full path of the output folder: \\n\\t\", io_path,\n",
    "      \"\\n The application name (the suffix of the outputs file name): \\n\\t\", app_name,\n",
    "      \"\\n The sub-model name, to locate the related feature configuration: \\n\\t\", submodel_name,\n",
    "      \"\\n The the sub-model's the file name of the input: \\n\\t\", submodel_input_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Initialise the logs\n",
    "if not os.path.exists(io_path):\n",
    "    os.makedirs(io_path, exist_ok=True)\n",
    "\n",
    "logger = Logger(path=io_path, app_name=app_name, ext=\"log\")\n",
    "logger = logging.getLogger(app_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Initialise constants          \n",
    "CONSTANTS.set(io_path, app_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Initialise other classes\n",
    "readers_writers = ReadersWriters()\n",
    "preprocess = PreProcess(CONSTANTS.io_path)\n",
    "feature_selection = FeatureSelection()\n",
    "plots = Plots()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Set print settings\n",
    "pd.set_option('display.width', 1600, 'display.max_colwidth', 800)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### 1.2.  Initialise Features Metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Read features metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# variables settings\n",
    "features_metadata = dict()\n",
    "\n",
    "features_metadata_all = readers_writers.load_csv(path=CONSTANTS.io_path, title=CONSTANTS.config_features_path, dataframing=True)\n",
    "features_metadata = features_metadata_all.loc[(features_metadata_all[\"Selected\"] == 1) & \n",
    "                                              (features_metadata_all[\"Table_Reference_Name\"] == submodel_name)]\n",
    "features_metadata.reset_index()\n",
    "    \n",
    "# print\n",
    "display(features_metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Set features' metadata dictionaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Dictionary of features types, dtypes, & max-states\n",
    "features_types = dict()\n",
    "features_dtypes = dict()\n",
    "features_states_values = dict()\n",
    "features_names_group = dict()\n",
    "\n",
    "for _, row in features_metadata.iterrows():\n",
    "    if not pd.isnull(row[\"Variable_Max_States\"]):\n",
    "        states_values = str(row[\"Variable_Max_States\"]).split(',') \n",
    "        states_values = list(map(int, states_values))\n",
    "    else: \n",
    "        states_values = None\n",
    "        \n",
    "    if not pd.isnull(row[\"Variable_Aggregation\"]):\n",
    "        postfixes = row[\"Variable_Aggregation\"].replace(' ', '').split(',')\n",
    "        f_types = row[\"Variable_Type\"].replace(' ', '').split(',')\n",
    "        f_dtypes = row[\"Variable_dType\"].replace(' ', '').split(',')\n",
    "        for p in range(len(postfixes)):\n",
    "            features_types[row[\"Variable_Name\"] + \"_\" + postfixes[p]] = f_types[p]\n",
    "            features_dtypes[row[\"Variable_Name\"] + \"_\" + postfixes[p]] = pd.Series(dtype=f_dtypes[p])\n",
    "            features_states_values[row[\"Variable_Name\"] + \"_\" + postfixes[p]] = states_values\n",
    "            features_names_group[row[\"Variable_Name\"] + \"_\" + postfixes[p]] = row[\"Variable_Name\"] + \"_\" + postfixes[p]\n",
    "    else:\n",
    "        features_types[row[\"Variable_Name\"]] = row[\"Variable_Type\"]\n",
    "        features_dtypes[row[\"Variable_Name\"]] = row[\"Variable_dType\"]\n",
    "        features_states_values[row[\"Variable_Name\"]] = states_values\n",
    "        features_names_group[row[\"Variable_Name\"]] = row[\"Variable_Name\"]\n",
    "        if states_values is not None:\n",
    "            for postfix in states_values:\n",
    "                features_names_group[row[\"Variable_Name\"] + \"_\" + str(postfix)] = row[\"Variable_Name\"]\n",
    "            \n",
    "features_dtypes = pd.DataFrame(features_dtypes).dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Dictionary of features groups\n",
    "features_types_group = OrderedDict()\n",
    "\n",
    "f_types = set([f_type for f_type in features_types.values()])\n",
    "features_types_group = OrderedDict(zip(list(f_types), [set() for _ in range(len(f_types))]))\n",
    "for f_name, f_type in features_types.items():\n",
    "    features_types_group[f_type].add(f_name)\n",
    "    \n",
    "print(\"Features types: \" + ','.join(f_types))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "source": [
    "## <font style=\"font-weight:bold;color:red\"> Prerequisites:</font>\n",
    "- Run this notebook after the feature selection (Section 7) of the TCARER_Basic Notebook!\n",
    "- Make sure the following files are present in the input folder:\n",
    "    - Step\\_05\\_Features.bz2\n",
    "    - Step\\_07\\_Top\\_Features\\_..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### 1.3. Load the Top Features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<font style=\"font-weight:bold;color:red\">Configure</font>: the selected features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Load the top features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "file_name = \"Step_07_Top_Features_rfc_adhoc\" \n",
    "\n",
    "features_names_selected = readers_writers.load_csv(path=CONSTANTS.io_path, title=file_name, dataframing=False)[0]\n",
    "features_names_selected = [f.replace(\"\\n\", \"\") for f in features_names_selected]\n",
    "display(pd.DataFrame(features_names_selected))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Exclude the encoded categorical & include the raw categorical features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "excludes = set([f for f in features_names_selected for f_cat in features_types_group[\"CATEGORICAL\"] if f.startswith(f_cat)])\n",
    "features_names_selected_raw = [f for f in features_names_selected if f not in excludes]\n",
    "features_names_selected_raw = list(features_types_group[\"CATEGORICAL\"]) + features_names_selected_raw\n",
    "\n",
    "print(\"Exclude encoded categorical: \", excludes)\n",
    "print(\"Include raw categorical: \", features_types_group[\"CATEGORICAL\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Select the top N features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "top_n_features = 300\n",
    "\n",
    "features_names_selected_raw = features_names_selected_raw[0:top_n_features]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## 1.4. Initialise Model Setting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<font style=\"font-weight:bold;color:red\">Configure</font>: the model files"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# select the target variable\n",
    "target_feature = \"label30\" # \"label30\", \"label365\"\n",
    "rank_models = [\"rfc\"] # [\"rfc\", \"gbrt\", \"randLogit\"]\n",
    "\n",
    "features_headers = [target_feature] + features_names_selected_raw\n",
    "train_file_names = [\"tensorflow_feature_train\"]\n",
    "test_file_names = [\"tensorflow_feature_test\"]\n",
    "\n",
    "train_file_names_full = [os.path.join(CONSTANTS.io_path, name + \".csv\") for name in train_file_names]\n",
    "test_file_names_full = [os.path.join(CONSTANTS.io_path, name + \".csv\") for name in test_file_names]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## 2. Read Data & Store CSV"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Read"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "file_name = \"Step_05_Features\"\n",
    "features = readers_writers.load_serialised_compressed(path=CONSTANTS.io_path, title=file_name)\n",
    "  \n",
    "print(\"File size: \", os.stat(os.path.join(CONSTANTS.io_path, file_name + \".bz2\")).st_size)\n",
    "print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
    "print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Visual verification"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "display(pd.concat([features[\"train_id\"].head(), features[\"train_target\"].head(), features[\"train_indep\"].head()], axis=1))\n",
    "display(pd.concat([features[\"test_id\"].head(), features[\"test_target\"].head(), features[\"test_indep\"].head()], axis=1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Select features and save to CSV"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# save train sample\n",
    "readers_writers.save_csv(\n",
    "    data=pd.concat([features[\"train_target\"].loc[:, [target_feature]], \n",
    "                    features[\"train_indep\"].loc[:, features_names_selected_raw]], axis=1), \n",
    "    path=CONSTANTS.io_path, title=train_file_names[0], append=False)\n",
    "print(\"File size: \", os.stat(os.path.join(CONSTANTS.io_path, train_file_names[0] + \".csv\")).st_size)\n",
    "\n",
    "# save test sample\n",
    "readers_writers.save_csv(\n",
    "    data=pd.concat([features[\"test_target\"].loc[:, [target_feature]], \n",
    "                    features[\"test_indep\"].loc[:, features_names_selected_raw]], axis=1), \n",
    "    path=CONSTANTS.io_path, title=test_file_names[0], append=False)\n",
    "print(\"File size: \", os.stat(os.path.join(CONSTANTS.io_path, test_file_names[0] + \".csv\")).st_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<font style=\"font-weight:bold;color:red\">Clean-up</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "features = None\n",
    "gc.collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## 3. Set TensorFlow Settings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<font style=\"font-weight:bold;color:red\">Configure</font>: the Deep Neural Network nodes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### 3.1. Prepare Features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Update features by type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# update features\n",
    "names = [i for i in features_types_group[\"CATEGORICAL\"]]\n",
    "for name in names:\n",
    "    if name not in features_names_selected_raw :\n",
    "        features_types_group[\"CATEGORICAL\"].remove(name)\n",
    "\n",
    "names = [i for i in features_types_group[\"CONTINUOUS\"]]\n",
    "for name in names:\n",
    "    if name not in features_names_selected_raw :\n",
    "        features_types_group[\"CONTINUOUS\"].remove(name)\n",
    "        \n",
    "print(\"Categorical Features: \", features_types_group[\"CATEGORICAL\"]) \n",
    "print(\"Continuous Features: \", features_types_group[\"CONTINUOUS\"])        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Additional variables to convert to discrete"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "names = [i for i in features_types_group[\"CONTINUOUS\"]]\n",
    "features_types_group[\"CATEGORICAL_EXTRA\"] = list()\n",
    "\n",
    "# convet gapDay_..., & epidur_... variables\n",
    "# states = [0, 3, 7, 14, 30, 60]\n",
    "for name in names:\n",
    "    if name[0:7] == \"gapDays_\" or name[0:7] == \"epidur_\":\n",
    "        features_types_group[\"CONTINUOUS\"].remove(name)\n",
    "        features_types_group[\"CATEGORICAL\"].add(name)\n",
    "        features_types_group[\"CATEGORICAL_EXTRA\"].append(name)\n",
    "        features_states_values[name] = [0, 3, 7, 14, 30, 60]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### 3.2. Define Base Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "feature_columns = dict()\n",
    "\n",
    "# Categorical base columns.\n",
    "for name in features_types_group[\"CATEGORICAL\"]:\n",
    "    feature_columns[name] = tf.contrib.layers.sparse_column_with_hash_bucket(\n",
    "        name, hash_bucket_size=len(features_states_values[name]), combiner=\"sqrtn\")\n",
    "\n",
    "# Continuous base columns.\n",
    "for name in features_types_group[\"CONTINUOUS\"]:\n",
    "    if features_states_values[name] is not None:\n",
    "        feature_columns[name] = tf.contrib.layers.real_valued_column(name)\n",
    "        feature_columns[name] = tf.contrib.layers.bucketized_column(\n",
    "            feature_columns[name], [int(i) for i in features_states_values[name]])\n",
    "    else:\n",
    "        feature_columns[name] = tf.contrib.layers.real_valued_column(name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### 3.3. The Wide Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "wide_columns = [feature_columns[name] for name in features_types_group[\"CATEGORICAL\"]]\n",
    "\n",
    "wide_columns = wide_columns + \\\n",
    "    [feature_columns[name] for name in features_types_group[\"CONTINUOUS\"] if features_states_values[name] is not None]\n",
    "    \n",
    "wide_columns = wide_columns + [\n",
    "    tf.contrib.layers.crossed_column([feature_columns['ethnos'], feature_columns['gender']], \n",
    "                                     combiner=\"sqrtn\", hash_bucket_size=int(2)),\n",
    "    tf.contrib.layers.crossed_column([feature_columns['imd04rk'], feature_columns['ethnos']], \n",
    "                                     combiner=\"sqrtn\", hash_bucket_size=int(4)),\n",
    "    tf.contrib.layers.crossed_column([feature_columns['imd04rk'], feature_columns['ageTrigger']], \n",
    "                                     combiner=\"sqrtn\", hash_bucket_size=int(10))]\n",
    "\n",
    "# for name in features_types_group[\"CATEGORICAL_EXTRA\"]:\n",
    "#    wide_columns = wide_columns + [\n",
    "#        tf.contrib.layers.crossed_column([feature_columns['ageTrigger'], feature_columns[name]], \n",
    "#                                     combiner=\"sqrtn\", hash_bucket_size=int(6e3))]\n",
    "\n",
    "print(wide_columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### 3.4. The Deep Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(features_types_group[\"CATEGORICAL\"])\n",
    "print([name for name in features_types_group[\"CONTINUOUS\"] if features_states_values[name] is not None])\n",
    "print([name for name in features_types_group[\"CONTINUOUS\"] if features_states_values[name] is None])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "deep_columns = [feature_columns[name] for name in features_types_group[\"CONTINUOUS\"]]\n",
    "    \n",
    "deep_columns = deep_columns + \\\n",
    "    [tf.contrib.layers.embedding_column(feature_columns[\"gender\"], dimension=2),\n",
    "     tf.contrib.layers.embedding_column(feature_columns[\"ethnos\"], dimension=3),\n",
    "    tf.contrib.layers.embedding_column(feature_columns[\"imd04rk\"], dimension=5),\n",
    "    tf.contrib.layers.embedding_column(feature_columns[\"ageTrigger\"], dimension=5)]\n",
    "    \n",
    "for name in features_types_group[\"CATEGORICAL_EXTRA\"]:\n",
    "    deep_columns = deep_columns + \\\n",
    "        [tf.contrib.layers.embedding_column(feature_columns[name], dimension=3)]\n",
    "        \n",
    "print(deep_columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Set the lists of continous and discrete function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "continuous_features = list(features_types_group[\"CONTINUOUS\"])\n",
    "discrete_features = list(features_types_group[\"CATEGORICAL\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "## 4. Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<font style=\"font-weight:bold;color:brown\"> Restore model if it was interupated</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# model_dir = \"/tmp/tmpn5lud12q\"\n",
    "# train_steps = 3518"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "# Restore variables from disk.\n",
    "# saver = tf.train.Saver()\n",
    "# sess = tf.Session()\n",
    "# saver.restore(sess, model_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### 4.1.  Initialise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Configure the size and batches of the Deep Neural Network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "train_batch_size = 2000\n",
    "train_steps = 500 # 40000\n",
    "train_num_epochs = None\n",
    "train_randomize_input = True\n",
    "\n",
    "test_batch_size = 2000\n",
    "test_steps = 500 # 300\n",
    "test_num_epochs = None\n",
    "test_randomize_input = False\n",
    "\n",
    "monitor_batch_size = 2000\n",
    "monitor_steps = 200\n",
    "monitor_num_epochs = None\n",
    "monitor_randomize_input = False\n",
    "\n",
    "dnn_hidden_units = [24000, 12000, 6000] # [20000, 16000, 10000, 8000, 7000, 6000, 5000, 4000] # [28000, 14000, 7000]  # [24000, 12000, 6000]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Initialise the perfromance statistics output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "summaries = dict()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Set the output directory of the Tensorflow model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "model_dir = tempfile.mkdtemp()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Combining Wide and Deep Models into One"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "config = tf.ConfigProto(allow_soft_placement=True)\n",
    "\n",
    "model_dnn = tf.contrib.learn.DNNLinearCombinedClassifier(\n",
    "                model_dir=model_dir,\n",
    "                linear_feature_columns=wide_columns,\n",
    "                config=None, # tf.contrib.learn.RunConfig(save_checkpoints_secs=600)),\n",
    "                dnn_feature_columns=deep_columns,\n",
    "                dnn_hidden_units=dnn_hidden_units,\n",
    "                dnn_optimizer=None, # tf.train.AdagradOptimizer(...)\n",
    "                linear_optimizer=None, # tf.train.FtrlOptimizer(...)\n",
    "                dnn_activation_fn=tf.nn.relu,\n",
    "                enable_centered_bias=False\n",
    "                #, gradient_clip_norm=1 # helper functions that let you apply L2 norms (tf.clip_by_global_norm)\n",
    "                )\n",
    "\n",
    "print(model_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Set the validation monitor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "validation_mointor = tf.contrib.learn.monitors.ValidationMonitor(\n",
    "    input_fn=lambda: input_fn(test_file_names_full, monitor_batch_size, \n",
    "                              monitor_num_epochs, monitor_randomize_input),\n",
    "    every_n_steps=monitor_steps)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Define a function for reading a sample batch by batch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "def read_csv_batches(file_names, batch_size, features_headers, num_epochs, randomize_input):\n",
    "\n",
    "    def parse_fn(record):\n",
    "        record_defaults = [tf.constant([''], dtype=tf.string)] * len(features_headers)\n",
    "        return tf.decode_csv(record, record_defaults)\n",
    "\n",
    "    df = tf.contrib.learn.read_batch_examples(\n",
    "        file_names,\n",
    "        batch_size=batch_size,\n",
    "        reader=tf.TextLineReader,\n",
    "        parse_fn=parse_fn,\n",
    "        num_epochs=num_epochs,\n",
    "        randomize_input=randomize_input)\n",
    "\n",
    "    # Important: convert examples to dict for ease of use in `input_fn`\n",
    "    # Map each header to its respective column (FEATURE_HEADERS order matters!\n",
    "    df_dict = {}\n",
    "    for i, header in enumerate(features_headers):\n",
    "        df_dict[header] = df[:, i]\n",
    "\n",
    "    return df_dict"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Represent the input data as the fundamental unit of TensorFlow computations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "def input_fn(file_names, batch_size, num_epochs, randomize_input):\n",
    "    df_dict = read_csv_batches(file_names, batch_size, features_headers, num_epochs, randomize_input)\n",
    "\n",
    "    with tf.Session(config=config) as sess:\n",
    "        # Creates a dictionary mapping from each continuous feature column name (k) to\n",
    "        # the values of that column stored in a constant Tensor.\n",
    "        continuous_cols = {k: tf.string_to_number(df_dict[k], out_type=tf.float32)\n",
    "                           for k in continuous_features}\n",
    "\n",
    "        # Creates a dictionary mapping from each categorical feature column name (k)\n",
    "        # to the values of that column stored in a tf.SparseTensor.\n",
    "        categorical_cols = {\n",
    "            k: tf.SparseTensor(\n",
    "                indices=[[i, 0] for i in range(int(df_dict[k].get_shape()[0]))],\n",
    "                values=df_dict[k],\n",
    "                dense_shape=[int(df_dict[k].get_shape()[0]), 1]) \n",
    "            for k in discrete_features}\n",
    "\n",
    "        # Merges the two dictionaries into one.\n",
    "        feature_cols = {**continuous_cols, **categorical_cols}\n",
    "\n",
    "        # Converts the label column into a constant Tensor.\n",
    "        label = tf.string_to_number(df_dict[target_feature], out_type=tf.int32)\n",
    "   \n",
    "    # Returns the feature columns and the label.\n",
    "    return feature_cols, label"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### 4.2.  Fit"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Train the Deep Neural Network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "train_randomize_input"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "model_dnn.fit(input_fn=lambda: input_fn(train_file_names_full, train_batch_size, \n",
    "                                        train_num_epochs, train_randomize_input), \n",
    "              steps=train_steps) # , monitors=[validation_mointor]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Save the output summaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "summaries[\"fit\"] = dict()    \n",
    "summaries[\"fit\"][\"get_variable_names\"] = str(model_dnn.get_variable_names)\n",
    "summaries[\"fit\"][\"get_variable_value\"] = str(model_dnn.get_variable_value)\n",
    "summaries[\"fit\"][\"get_params\"] = str(model_dnn.get_params)\n",
    "summaries[\"fit\"][\"export\"] = str(model_dnn.export)\n",
    "summaries[\"fit\"][\"get_variable_names()\"] = model_dnn.get_variable_names()\n",
    "summaries[\"fit\"][\"params\"] = str(model_dnn.params)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### 4.3.  Predict - Train Sample"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Test the Deep Neural Network, using the train sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "results = model_dnn.evaluate(input_fn=lambda: input_fn(train_file_names_full, test_batch_size, \n",
    "                                                       test_num_epochs, test_randomize_input), \n",
    "                             steps=test_steps)\n",
    "for key in sorted(results):\n",
    "    print(\"%s: %s\" % (key, results[key]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Save the output summaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "summaries[\"train\"] = dict()\n",
    "summaries[\"train\"][\"results\"] = results \n",
    "summaries[\"train\"][\"predict_proba\"] = model_dnn.predict_proba(\n",
    "    input_fn=lambda: input_fn(train_file_names_full, test_batch_size, \n",
    "                              test_num_epochs, test_randomize_input))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### 4.4.  Predict - Test Sample"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Test the Deep Neural Network, using the test sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "results = model_dnn.evaluate(input_fn=lambda: input_fn(test_file_names_full, test_batch_size, \n",
    "                                                       test_num_epochs, test_randomize_input), \n",
    "                             steps=test_steps)\n",
    "for key in sorted(results):\n",
    "    print(\"%s: %s\" % (key, results[key]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Save the output summaries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "summaries[\"test\"] = dict()\n",
    "summaries[\"test\"][\"results\"] = results \n",
    "summaries[\"test\"][\"predict_proba\"] = model_dnn.predict_proba(\n",
    "    input_fn=lambda: input_fn(test_file_names_full, test_batch_size, \n",
    "                              test_num_epochs, test_randomize_input))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "### 4.5.  Save"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Save the output summaries, including the predicted probabilities"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "#### 4.5.1. Stats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "def generator_to_list(generator, max_size):\n",
    "    j = 0\n",
    "    temp = [None] * max_size\n",
    "\n",
    "    for value in generator:\n",
    "        temp[j] = value\n",
    "        j += 1\n",
    "        if j >= max_size:\n",
    "            break\n",
    "\n",
    "    return temp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "summaries[\"train\"][\"predict_proba\"] = \\\n",
    "    generator_to_list(summaries[\"train\"][\"predict_proba\"], test_batch_size  * test_steps)\n",
    "summaries[\"test\"][\"predict_proba\"] = \\\n",
    "    generator_to_list(summaries[\"test\"][\"predict_proba\"], test_batch_size  * test_steps)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "file_name = \"model_tensorflow_summaries_\" + target_feature\n",
    "readers_writers.save_serialised_compressed(path=CONSTANTS.io_path, title=file_name, objects=summaries)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Test the saved file for corruption"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "deletable": true,
    "editable": true
   },
   "outputs": [],
   "source": [
    "print(\"The model temp. directory to back up:\")\n",
    "print(model_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "Fin!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}