Diff of /TCARER_Basic.ipynb [000000] .. [b4a150]

Switch to unified view

a b/TCARER_Basic.ipynb
1
{
2
 "cells": [
3
  {
4
   "cell_type": "markdown",
5
   "metadata": {
6
    "deletable": true,
7
    "editable": true
8
   },
9
   "source": [
10
    "# Temporal-Comorbidity Adjusted Risk of Emergency Readmission (TCARER)\n",
11
    "## <font style=\"font-weight:bold;color:gray\">Basic Models</font>"
12
   ]
13
  },
14
  {
15
   "cell_type": "markdown",
16
   "metadata": {
17
    "deletable": true,
18
    "editable": true
19
   },
20
   "source": [
21
    "[1. Initialise](#1.-Initialise)\n",
22
    "<br\\>\n",
23
    "[2. Generate Features](#2.-Generate-Features)\n",
24
    "<br\\>\n",
25
    "[3. Read Data](#3.-Read-Data)\n",
26
    "<br\\>\n",
27
    "[4. Filter Features](#4.-Filter-Features)\n",
28
    "<br\\>\n",
29
    "[5. Set Samples &amp; Target Features](#5.-Set-Samples-&amp;-Target-Features)\n",
30
    "<br\\>\n",
31
    "[6. Recategorise &amp; Transform](#6.-Recategorise-&amp;-Transform)\n",
32
    "<br\\>\n",
33
    "[7. Rank &amp; Select Features](#7.-Rank-&amp;-Select-Features)\n",
34
    "<br\\>\n",
35
    "[8. Model](#8.-Model)\n",
36
    "<br\\>"
37
   ]
38
  },
39
  {
40
   "cell_type": "markdown",
41
   "metadata": {
42
    "deletable": true,
43
    "editable": true
44
   },
45
   "source": [
46
    "This Jupyter IPython Notebook applies the Temporal-Comorbidity Adjusted Risk of Emergency Readmission (TCARER).\n",
47
    "\n",
48
    "This Jupyter IPython Notebook extract aggregated features from the MySQL database, &amp; then pre-process, configure &amp; apply several modelling approaches. \n",
49
    "\n",
50
    "The pre-processing framework &amp; modelling algorithms in this Notebook are developed as part of the Integrated Care project at the <a href=\"http://www.healthcareanalytics.co.uk/\">Health &amp; Social Care Modelling Group (HSCMG)</a>, The <a href=\"http://www.westminster.ac.uk\">University of Westminster</a>.\n",
51
    "\n",
52
    "Note that some of the scripts are optional or subject to some pre-configurations. Please refer to the comments &amp; the project documentations for further details."
53
   ]
54
  },
55
  {
56
   "cell_type": "markdown",
57
   "metadata": {
58
    "deletable": true,
59
    "editable": true
60
   },
61
   "source": [
62
    "<hr\\>\n",
63
    "<font size=\"1\" color=\"gray\">Copyright 2017 The Project Authors. All Rights Reserved.\n",
64
    "\n",
65
    "It is licensed under the Apache License, Version 2.0. you may not use this file except in compliance with the License. You may obtain a copy of the License at\n",
66
    "\n",
67
    "  <a href=\"http://www.apache.org/licenses/LICENSE-2.0\">http://www.apache.org/licenses/LICENSE-2.0</a>\n",
68
    "\n",
69
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.</font>\n",
70
    "<hr\\>"
71
   ]
72
  },
73
  {
74
   "cell_type": "markdown",
75
   "metadata": {
76
    "deletable": true,
77
    "editable": true
78
   },
79
   "source": [
80
    "## 1. Initialise"
81
   ]
82
  },
83
  {
84
   "cell_type": "markdown",
85
   "metadata": {
86
    "deletable": true,
87
    "editable": true
88
   },
89
   "source": [
90
    "Reload modules"
91
   ]
92
  },
93
  {
94
   "cell_type": "code",
95
   "execution_count": null,
96
   "metadata": {
97
    "collapsed": true,
98
    "deletable": true,
99
    "editable": true,
100
    "scrolled": true
101
   },
102
   "outputs": [],
103
   "source": [
104
    "# Reload modules \n",
105
    "# It is an optional step. It is useful to run when external Python modules are being modified\n",
106
    "# It is reloading all modules (except those excluded by %aimport) every time before executing the Python code typed.\n",
107
    "# Note: It may conflict with serialisation, when external modules are being modified\n",
108
    "\n",
109
    "# %load_ext autoreload \n",
110
    "# %autoreload 2"
111
   ]
112
  },
113
  {
114
   "cell_type": "markdown",
115
   "metadata": {
116
    "deletable": true,
117
    "editable": true
118
   },
119
   "source": [
120
    "Import libraries"
121
   ]
122
  },
123
  {
124
   "cell_type": "code",
125
   "execution_count": null,
126
   "metadata": {
127
    "collapsed": true,
128
    "deletable": true,
129
    "editable": true
130
   },
131
   "outputs": [],
132
   "source": [
133
    "# Import Python libraries\n",
134
    "import logging\n",
135
    "import os\n",
136
    "import sys\n",
137
    "import gc\n",
138
    "import pandas as pd\n",
139
    "from IPython.display import display, HTML\n",
140
    "from collections import OrderedDict\n",
141
    "import numpy as np\n",
142
    "import statistics\n",
143
    "from scipy.stats import stats"
144
   ]
145
  },
146
  {
147
   "cell_type": "code",
148
   "execution_count": null,
149
   "metadata": {
150
    "collapsed": false,
151
    "deletable": true,
152
    "editable": true
153
   },
154
   "outputs": [],
155
   "source": [
156
    "# Import local Python modules\n",
157
    "from Configs.CONSTANTS import CONSTANTS\n",
158
    "from Configs.Logger import Logger\n",
159
    "from Features.Variables import Variables\n",
160
    "from ReadersWriters.ReadersWriters import ReadersWriters\n",
161
    "from Stats.PreProcess import PreProcess\n",
162
    "from Stats.FeatureSelection import FeatureSelection\n",
163
    "from Stats.TrainingMethod import TrainingMethod\n",
164
    "from Stats.Plots import Plots"
165
   ]
166
  },
167
  {
168
   "cell_type": "code",
169
   "execution_count": null,
170
   "metadata": {
171
    "collapsed": false,
172
    "deletable": true,
173
    "editable": true,
174
    "scrolled": true
175
   },
176
   "outputs": [],
177
   "source": [
178
    "# Check the interpreter\n",
179
    "print(\"\\nMake sure the correct Python interpreter is used!\")\n",
180
    "print(sys.version)\n",
181
    "print(\"\\nMake sure sys.path of the Python interpreter is correct!\")\n",
182
    "print(os.getcwd())"
183
   ]
184
  },
185
  {
186
   "cell_type": "markdown",
187
   "metadata": {
188
    "deletable": true,
189
    "editable": true
190
   },
191
   "source": [
192
    "<br/><br/>"
193
   ]
194
  },
195
  {
196
   "cell_type": "markdown",
197
   "metadata": {
198
    "deletable": true,
199
    "editable": true
200
   },
201
   "source": [
202
    "### 1.1.  Initialise General Settings"
203
   ]
204
  },
205
  {
206
   "cell_type": "markdown",
207
   "metadata": {
208
    "deletable": true,
209
    "editable": true
210
   },
211
   "source": [
212
    "<font style=\"font-weight:bold;color:red\">Main configuration Settings: </font>\n",
213
    "- Specify the full path of the configuration file \n",
214
    "<br/>&#9; &#8594; config_path\n",
215
    "- Specify the full path of the output folder \n",
216
    "<br/>&#9; &#8594; io_path\n",
217
    "- Specify the application name (the suffix of the outputs file name) \n",
218
    "<br/>&#9; &#8594; app_name\n",
219
    "- Specify the sub-model name, to locate the related feature configuration, based on the \"Table_Reference_Name\" column in the configuration file\n",
220
    "<br/>&#9; &#8594; submodel_name\n",
221
    "- Specify the sub-model's the file name of the input (excluding the CSV extension)\n",
222
    "<br/>&#9; &#8594; submodel_input_name\n",
223
    "<br/>\n",
224
    "<br/>\n",
225
    "\n",
226
    "<font style=\"font-weight:bold;color:red\">External Configration Files: </font>\n",
227
    "- The MySQL database configuration setting &amp; other configration metadata\n",
228
    "<br/>&#9; &#8594; <i>Inputs/CONFIGURATIONS_1.ini</i>\n",
229
    "- The input features' confugration file (Note: only the CSV export of the XLSX will be used by this Notebook)\n",
230
    "<br/>&#9; &#8594; <i>Inputs/config_features_path.xlsx</i>\n",
231
    "<br/>&#9; &#8594; <i>Inputs/config_features_path.csv</i>"
232
   ]
233
  },
234
  {
235
   "cell_type": "code",
236
   "execution_count": null,
237
   "metadata": {
238
    "collapsed": false,
239
    "deletable": true,
240
    "editable": true
241
   },
242
   "outputs": [],
243
   "source": [
244
    "config_path = os.path.abspath(\"ConfigInputs/CONFIGURATIONS.ini\")\n",
245
    "io_path = os.path.abspath(\"../../tmp/TCARER/Basic_prototype\")\n",
246
    "app_name = \"T-CARER\"\n",
247
    "submodel_name = \"hesIp\"\n",
248
    "submodel_input_name = \"tcarer_model_features_ip\"\n",
249
    "\n",
250
    "print(\"\\n The full path of the configuration file: \\n\\t\", config_path,\n",
251
    "      \"\\n The full path of the output folder: \\n\\t\", io_path,\n",
252
    "      \"\\n The application name (the suffix of the outputs file name): \\n\\t\", app_name,\n",
253
    "      \"\\n The sub-model name, to locate the related feature configuration: \\n\\t\", submodel_name,\n",
254
    "      \"\\n The the sub-model's the file name of the input: \\n\\t\", submodel_input_name)"
255
   ]
256
  },
257
  {
258
   "cell_type": "markdown",
259
   "metadata": {
260
    "deletable": true,
261
    "editable": true
262
   },
263
   "source": [
264
    "<br/><br/>"
265
   ]
266
  },
267
  {
268
   "cell_type": "markdown",
269
   "metadata": {
270
    "deletable": true,
271
    "editable": true
272
   },
273
   "source": [
274
    "Initialise logs"
275
   ]
276
  },
277
  {
278
   "cell_type": "code",
279
   "execution_count": null,
280
   "metadata": {
281
    "collapsed": false,
282
    "deletable": true,
283
    "editable": true
284
   },
285
   "outputs": [],
286
   "source": [
287
    "if not os.path.exists(io_path):\n",
288
    "    os.makedirs(io_path, exist_ok=True)\n",
289
    "\n",
290
    "logger = Logger(path=io_path, app_name=app_name, ext=\"log\")\n",
291
    "logger = logging.getLogger(app_name)"
292
   ]
293
  },
294
  {
295
   "cell_type": "markdown",
296
   "metadata": {
297
    "deletable": true,
298
    "editable": true
299
   },
300
   "source": [
301
    "Initialise constants and some of classes"
302
   ]
303
  },
304
  {
305
   "cell_type": "code",
306
   "execution_count": null,
307
   "metadata": {
308
    "collapsed": false,
309
    "deletable": true,
310
    "editable": true,
311
    "scrolled": true
312
   },
313
   "outputs": [],
314
   "source": [
315
    "# Initialise constants        \n",
316
    "CONSTANTS.set(io_path, app_name)"
317
   ]
318
  },
319
  {
320
   "cell_type": "code",
321
   "execution_count": null,
322
   "metadata": {
323
    "collapsed": true,
324
    "deletable": true,
325
    "editable": true
326
   },
327
   "outputs": [],
328
   "source": [
329
    "# Initialise other classes\n",
330
    "readers_writers = ReadersWriters()\n",
331
    "preprocess = PreProcess(io_path)\n",
332
    "feature_selection = FeatureSelection()\n",
333
    "plts = Plots()"
334
   ]
335
  },
336
  {
337
   "cell_type": "code",
338
   "execution_count": null,
339
   "metadata": {
340
    "collapsed": true,
341
    "deletable": true,
342
    "editable": true
343
   },
344
   "outputs": [],
345
   "source": [
346
    "# Set print settings\n",
347
    "pd.set_option('display.width', 1600, 'display.max_colwidth', 800)"
348
   ]
349
  },
350
  {
351
   "cell_type": "markdown",
352
   "metadata": {
353
    "deletable": true,
354
    "editable": true
355
   },
356
   "source": [
357
    "### 1.2.  Initialise Features Metadata"
358
   ]
359
  },
360
  {
361
   "cell_type": "markdown",
362
   "metadata": {
363
    "deletable": true,
364
    "editable": true
365
   },
366
   "source": [
367
    "Read the input features' confugration file &amp; store the features metadata"
368
   ]
369
  },
370
  {
371
   "cell_type": "code",
372
   "execution_count": null,
373
   "metadata": {
374
    "collapsed": false,
375
    "deletable": true,
376
    "editable": true,
377
    "scrolled": true
378
   },
379
   "outputs": [],
380
   "source": [
381
    "# variables settings\n",
382
    "features_metadata = dict()\n",
383
    "\n",
384
    "features_metadata_all = readers_writers.load_csv(path=CONSTANTS.io_path, title=CONSTANTS.config_features_path, dataframing=True)\n",
385
    "features_metadata = features_metadata_all.loc[(features_metadata_all[\"Selected\"] == 1) & \n",
386
    "                                              (features_metadata_all[\"Table_Reference_Name\"] == submodel_name)]\n",
387
    "features_metadata.reset_index()\n",
388
    "    \n",
389
    "# print\n",
390
    "display(features_metadata)"
391
   ]
392
  },
393
  {
394
   "cell_type": "markdown",
395
   "metadata": {
396
    "deletable": true,
397
    "editable": true
398
   },
399
   "source": [
400
    "Set input features' metadata dictionaries"
401
   ]
402
  },
403
  {
404
   "cell_type": "code",
405
   "execution_count": null,
406
   "metadata": {
407
    "collapsed": false,
408
    "deletable": true,
409
    "editable": true
410
   },
411
   "outputs": [],
412
   "source": [
413
    "# Dictionary of features types, dtypes, & max-states\n",
414
    "features_types = dict()\n",
415
    "features_dtypes = dict()\n",
416
    "features_states_values = dict()\n",
417
    "features_names_group = dict()\n",
418
    "\n",
419
    "for _, row in features_metadata.iterrows():\n",
420
    "    if not pd.isnull(row[\"Variable_Max_States\"]):\n",
421
    "        states_values = str(row[\"Variable_Max_States\"]).split(',') \n",
422
    "        states_values = list(map(int, states_values))\n",
423
    "    else: \n",
424
    "        states_values = None\n",
425
    "        \n",
426
    "    if not pd.isnull(row[\"Variable_Aggregation\"]):\n",
427
    "        postfixes = row[\"Variable_Aggregation\"].replace(' ', '').split(',')\n",
428
    "        f_types = row[\"Variable_Type\"].replace(' ', '').split(',')\n",
429
    "        f_dtypes = row[\"Variable_dType\"].replace(' ', '').split(',')\n",
430
    "        for p in range(len(postfixes)):\n",
431
    "            features_types[row[\"Variable_Name\"] + \"_\" + postfixes[p]] = f_types[p]\n",
432
    "            features_dtypes[row[\"Variable_Name\"] + \"_\" + postfixes[p]] = pd.Series(dtype=f_dtypes[p])\n",
433
    "            features_states_values[row[\"Variable_Name\"] + \"_\" + postfixes[p]] = states_values\n",
434
    "            features_names_group[row[\"Variable_Name\"] + \"_\" + postfixes[p]] = row[\"Variable_Name\"] + \"_\" + postfixes[p]\n",
435
    "    else:\n",
436
    "        features_types[row[\"Variable_Name\"]] = row[\"Variable_Type\"]\n",
437
    "        features_dtypes[row[\"Variable_Name\"]] = row[\"Variable_dType\"]\n",
438
    "        features_states_values[row[\"Variable_Name\"]] = states_values\n",
439
    "        features_names_group[row[\"Variable_Name\"]] = row[\"Variable_Name\"]\n",
440
    "        if states_values is not None:\n",
441
    "            for postfix in states_values:\n",
442
    "                features_names_group[row[\"Variable_Name\"] + \"_\" + str(postfix)] = row[\"Variable_Name\"]\n",
443
    "            \n",
444
    "features_dtypes = pd.DataFrame(features_dtypes).dtypes"
445
   ]
446
  },
447
  {
448
   "cell_type": "code",
449
   "execution_count": null,
450
   "metadata": {
451
    "collapsed": false,
452
    "deletable": true,
453
    "editable": true
454
   },
455
   "outputs": [],
456
   "source": [
457
    "# Dictionary of features groups\n",
458
    "features_types_group = OrderedDict()\n",
459
    "\n",
460
    "f_types = set([f_type for f_type in features_types.values()])\n",
461
    "features_types_group = OrderedDict(zip(list(f_types), [set() for _ in range(len(f_types))]))\n",
462
    "for f_name, f_type in features_types.items():\n",
463
    "    features_types_group[f_type].add(f_name)\n",
464
    "    \n",
465
    "print(\"Available features types: \" + ','.join(f_types))"
466
   ]
467
  },
468
  {
469
   "cell_type": "markdown",
470
   "metadata": {
471
    "deletable": true,
472
    "editable": true
473
   },
474
   "source": [
475
    "<br/><br/>"
476
   ]
477
  },
478
  {
479
   "cell_type": "markdown",
480
   "metadata": {
481
    "deletable": true,
482
    "editable": true
483
   },
484
   "source": [
485
    "## <font style=\"font-weight:bold;color:red\">2. Generate Features</font>"
486
   ]
487
  },
488
  {
489
   "cell_type": "markdown",
490
   "metadata": {
491
    "deletable": true,
492
    "editable": true
493
   },
494
   "source": [
495
    "<font style=\"font-weight:bold;color:red\">Notes:</font>\n",
496
    "- It generates the final spell-wise &amp; temporal features from the MySQL table(s), &amp; converts it into CSV(s);\n",
497
    "- It generates the CSV(s) based on the configuration file of the features (Note: only the CSV export of the XLSX will be used by this Notebook)\n",
498
    "<br/>&#9; &#8594; <i>Inputs/config_features_path.xlsx</i>\n",
499
    "<br/>&#9; &#8594; <i>Inputs/config_features_path.csv</i>"
500
   ]
501
  },
502
  {
503
   "cell_type": "code",
504
   "execution_count": null,
505
   "metadata": {
506
    "collapsed": false,
507
    "deletable": true,
508
    "editable": true
509
   },
510
   "outputs": [],
511
   "source": [
512
    "skip = True\n",
513
    "\n",
514
    "# settings\n",
515
    "csv_schema = [\"my_db_schema\"]\n",
516
    "csv_input_tables = [\"tcarer_features\"]\n",
517
    "csv_history_tables = [\"hesIp\"]\n",
518
    "csv_column_index = \"localID\"\n",
519
    "csv_output_table = \"tcarer_model_features_ip\"\n",
520
    "csv_query_batch_size =  100000"
521
   ]
522
  },
523
  {
524
   "cell_type": "code",
525
   "execution_count": null,
526
   "metadata": {
527
    "collapsed": false,
528
    "deletable": true,
529
    "editable": true
530
   },
531
   "outputs": [],
532
   "source": [
533
    "if skip is False:\n",
534
    "    # generate the csv file\n",
535
    "    variables = Variables(submodel_name,\n",
536
    "                          CONSTANTS.io_path,\n",
537
    "                          CONSTANTS.io_path,\n",
538
    "                          CONSTANTS.config_features_path,\n",
539
    "                          csv_output_table)\n",
540
    "    variables.set(csv_schema, csv_input_tables, csv_history_tables, csv_column_index, csv_query_batch_size)"
541
   ]
542
  },
543
  {
544
   "cell_type": "markdown",
545
   "metadata": {
546
    "deletable": true,
547
    "editable": true
548
   },
549
   "source": [
550
    "<br/><br/>"
551
   ]
552
  },
553
  {
554
   "cell_type": "markdown",
555
   "metadata": {
556
    "deletable": true,
557
    "editable": true
558
   },
559
   "source": [
560
    "## 3. Read Data"
561
   ]
562
  },
563
  {
564
   "cell_type": "markdown",
565
   "metadata": {
566
    "deletable": true,
567
    "editable": true
568
   },
569
   "source": [
570
    "Read the input features from the CSV input file"
571
   ]
572
  },
573
  {
574
   "cell_type": "code",
575
   "execution_count": null,
576
   "metadata": {
577
    "collapsed": false,
578
    "deletable": true,
579
    "editable": true
580
   },
581
   "outputs": [],
582
   "source": [
583
    "features_input = readers_writers.load_csv(path=CONSTANTS.io_path, title=submodel_input_name, dataframing=True)\n",
584
    "features_input.astype(dtype=features_dtypes)\n",
585
    "\n",
586
    "print(\"Number of columns: \", len(features_input.columns), \"; Total records: \", len(features_input.index))"
587
   ]
588
  },
589
  {
590
   "cell_type": "markdown",
591
   "metadata": {
592
    "deletable": true,
593
    "editable": true
594
   },
595
   "source": [
596
    "Verify features visually"
597
   ]
598
  },
599
  {
600
   "cell_type": "code",
601
   "execution_count": null,
602
   "metadata": {
603
    "collapsed": false,
604
    "deletable": true,
605
    "editable": true
606
   },
607
   "outputs": [],
608
   "source": [
609
    "display(features_input.head())"
610
   ]
611
  },
612
  {
613
   "cell_type": "markdown",
614
   "metadata": {
615
    "deletable": true,
616
    "editable": true
617
   },
618
   "source": [
619
    "<br/><br/>"
620
   ]
621
  },
622
  {
623
   "cell_type": "markdown",
624
   "metadata": {
625
    "collapsed": true,
626
    "deletable": true,
627
    "editable": true
628
   },
629
   "source": [
630
    "## 4. Filter Features"
631
   ]
632
  },
633
  {
634
   "cell_type": "markdown",
635
   "metadata": {
636
    "deletable": true,
637
    "editable": true
638
   },
639
   "source": [
640
    "### 4.1. Descriptive Statsistics"
641
   ]
642
  },
643
  {
644
   "cell_type": "markdown",
645
   "metadata": {
646
    "deletable": true,
647
    "editable": true
648
   },
649
   "source": [
650
    "Produce a descriptive stat report of 'Categorical', 'Continuous', & 'TARGET' features"
651
   ]
652
  },
653
  {
654
   "cell_type": "code",
655
   "execution_count": null,
656
   "metadata": {
657
    "collapsed": false,
658
    "deletable": true,
659
    "editable": true
660
   },
661
   "outputs": [],
662
   "source": [
663
    "file_name = \"Step_04_Data_ColumnNames\"\n",
664
    "readers_writers.save_csv(path=CONSTANTS.io_path, title=file_name, data=list(features_input.columns.values), append=False)\n",
665
    "file_name = \"Step_04_Stats_Categorical\"\n",
666
    "o_stats = preprocess.stats_discrete_df(df=features_input, includes=features_types_group[\"CATEGORICAL\"],\n",
667
    "                                       file_name=file_name)\n",
668
    "file_name = \"Step_04_Stats_Continuous\"\n",
669
    "o_stats = preprocess.stats_continuous_df(df=features_input, includes=features_types_group[\"CONTINUOUS\"], \n",
670
    "                                         file_name=file_name)\n",
671
    "file_name = \"Step_04_Stats_Target\"\n",
672
    "o_stats = preprocess.stats_discrete_df(df=features_input, includes=features_types_group[\"TARGET\"], \n",
673
    "                                       file_name=file_name)"
674
   ]
675
  },
676
  {
677
   "cell_type": "markdown",
678
   "metadata": {
679
    "deletable": true,
680
    "editable": true
681
   },
682
   "source": [
683
    "### 4.2. Selected Population"
684
   ]
685
  },
686
  {
687
   "cell_type": "markdown",
688
   "metadata": {
689
    "deletable": true,
690
    "editable": true
691
   },
692
   "source": [
693
    "#### 4.2.1. Remove Excluded Population, Remove Unused Features"
694
   ]
695
  },
696
  {
697
   "cell_type": "markdown",
698
   "metadata": {
699
    "deletable": true,
700
    "editable": true
701
   },
702
   "source": [
703
    "<i>Nothing to do!<i/> \n",
704
    "<br/>\n",
705
    "<font style=\"font-weight:bold;color:red\">Notes: </font> \n",
706
    "- Ideally the features must be configured before generating the CSV feature file, as it is very inefficient to derive new features at this stage\n",
707
    "- This step is not necessary, if all the features are generated in prior to the generatiion of the CSV feature file"
708
   ]
709
  },
710
  {
711
   "cell_type": "code",
712
   "execution_count": null,
713
   "metadata": {
714
    "collapsed": true,
715
    "deletable": true,
716
    "editable": true
717
   },
718
   "outputs": [],
719
   "source": [
720
    "# Exclusion of unused features\n",
721
    "# excluded = [name for name in features_input.columns if name not in features_names_group.keys()]\n",
722
    "# features_input = features_input.drop(excluded, axis=1)\n",
723
    "\n",
724
    "# print(\"Number of columns: \", len(features_input.columns), \"; Total records: \", len(features_input.index))"
725
   ]
726
  },
727
  {
728
   "cell_type": "markdown",
729
   "metadata": {
730
    "deletable": true,
731
    "editable": true
732
   },
733
   "source": [
734
    "<br/><br/>"
735
   ]
736
  },
737
  {
738
   "cell_type": "markdown",
739
   "metadata": {
740
    "deletable": true,
741
    "editable": true
742
   },
743
   "source": [
744
    "## 5. Set Samples &amp; Target Features"
745
   ]
746
  },
747
  {
748
   "cell_type": "markdown",
749
   "metadata": {
750
    "collapsed": true,
751
    "deletable": true,
752
    "editable": true
753
   },
754
   "source": [
755
    "### 5.1. Set Features"
756
   ]
757
  },
758
  {
759
   "cell_type": "markdown",
760
   "metadata": {
761
    "deletable": true,
762
    "editable": true
763
   },
764
   "source": [
765
    "#### 5.1.1. Train & Test Samples"
766
   ]
767
  },
768
  {
769
   "cell_type": "markdown",
770
   "metadata": {
771
    "deletable": true,
772
    "editable": true
773
   },
774
   "source": [
775
    "Set the samples"
776
   ]
777
  },
778
  {
779
   "cell_type": "code",
780
   "execution_count": null,
781
   "metadata": {
782
    "collapsed": false,
783
    "deletable": true,
784
    "editable": true
785
   },
786
   "outputs": [],
787
   "source": [
788
    "frac_train = 0.50\n",
789
    "replace = False\n",
790
    "random_state = 100\n",
791
    "\n",
792
    "nrows = len(features_input.index)\n",
793
    "features = {\"train\": dict(), \"test\": dict()}\n",
794
    "features[\"train\"] = features_input.sample(frac=frac_train, replace=False, random_state=100)\n",
795
    "features[\"test\"] = features_input.drop(features[\"train\"].index)\n",
796
    "\n",
797
    "features[\"train\"] = features[\"train\"].reset_index(drop=True)\n",
798
    "features[\"test\"] = features[\"test\"].reset_index(drop=True)"
799
   ]
800
  },
801
  {
802
   "cell_type": "markdown",
803
   "metadata": {
804
    "deletable": true,
805
    "editable": true
806
   },
807
   "source": [
808
    "Verify features visually"
809
   ]
810
  },
811
  {
812
   "cell_type": "code",
813
   "execution_count": null,
814
   "metadata": {
815
    "collapsed": false,
816
    "deletable": true,
817
    "editable": true
818
   },
819
   "outputs": [],
820
   "source": [
821
    "display(features_input.head())"
822
   ]
823
  },
824
  {
825
   "cell_type": "markdown",
826
   "metadata": {
827
    "deletable": true,
828
    "editable": true
829
   },
830
   "source": [
831
    "<font style=\"font-weight:bold;color:red\">Clean-Up</font>"
832
   ]
833
  },
834
  {
835
   "cell_type": "code",
836
   "execution_count": null,
837
   "metadata": {
838
    "collapsed": false,
839
    "deletable": true,
840
    "editable": true
841
   },
842
   "outputs": [],
843
   "source": [
844
    "features_input = None\n",
845
    "gc.collect()"
846
   ]
847
  },
848
  {
849
   "cell_type": "markdown",
850
   "metadata": {
851
    "deletable": true,
852
    "editable": true
853
   },
854
   "source": [
855
    "#### 5.1.2. Independent & Target variable¶"
856
   ]
857
  },
858
  {
859
   "cell_type": "markdown",
860
   "metadata": {
861
    "deletable": true,
862
    "editable": true
863
   },
864
   "source": [
865
    "Set independent, target &amp; ID features"
866
   ]
867
  },
868
  {
869
   "cell_type": "code",
870
   "execution_count": null,
871
   "metadata": {
872
    "collapsed": true,
873
    "deletable": true,
874
    "editable": true
875
   },
876
   "outputs": [],
877
   "source": [
878
    "target_labels = list(features_types_group[\"TARGET\"])\n",
879
    "target_id = [\"patientID\"]"
880
   ]
881
  },
882
  {
883
   "cell_type": "code",
884
   "execution_count": null,
885
   "metadata": {
886
    "collapsed": true,
887
    "deletable": true,
888
    "editable": true
889
   },
890
   "outputs": [],
891
   "source": [
892
    "features[\"train_indep\"] = dict()\n",
893
    "features[\"train_target\"] = dict()\n",
894
    "features[\"train_id\"] = dict()\n",
895
    "features[\"test_indep\"] = dict()\n",
896
    "features[\"test_target\"] = dict()\n",
897
    "features[\"test_id\"] = dict()\n",
898
    "\n",
899
    "# Independent and target features\n",
900
    "def set_features_indep_target(df):\n",
901
    "    df_targets = pd.DataFrame(dict(zip(target_labels, [[]] * len(target_labels))))\n",
902
    "    for i in range(len(target_labels)):\n",
903
    "        df_targets[target_labels[i]] = df[target_labels[i]]\n",
904
    "        \n",
905
    "    df_indep = df.drop(target_labels + target_id, axis=1)\n",
906
    "    df_id = pd.DataFrame({target_id[0]: df[target_id[0]]})\n",
907
    "    \n",
908
    "    return df_indep, df_targets, df_id"
909
   ]
910
  },
911
  {
912
   "cell_type": "code",
913
   "execution_count": null,
914
   "metadata": {
915
    "collapsed": false,
916
    "deletable": true,
917
    "editable": true
918
   },
919
   "outputs": [],
920
   "source": [
921
    "# train & test sets\n",
922
    "features[\"train_indep\"], features[\"train_target\"], features[\"train_id\"] = set_features_indep_target(features[\"train\"])\n",
923
    "features[\"test_indep\"], features[\"test_target\"], features[\"test_id\"] = set_features_indep_target(features[\"test\"])\n",
924
    "\n",
925
    "# print    \n",
926
    "print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
927
    "print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
928
   ]
929
  },
930
  {
931
   "cell_type": "markdown",
932
   "metadata": {
933
    "deletable": true,
934
    "editable": true
935
   },
936
   "source": [
937
    "Verify features visually"
938
   ]
939
  },
940
  {
941
   "cell_type": "code",
942
   "execution_count": null,
943
   "metadata": {
944
    "collapsed": false,
945
    "deletable": true,
946
    "editable": true,
947
    "scrolled": true
948
   },
949
   "outputs": [],
950
   "source": [
951
    "display(pd.concat([features[\"train_id\"].head(), features[\"train_target\"].head(), features[\"train_indep\"].head()], axis=1))\n",
952
    "display(pd.concat([features[\"test_id\"].head(), features[\"test_target\"].head(), features[\"test_indep\"].head()], axis=1))"
953
   ]
954
  },
955
  {
956
   "cell_type": "markdown",
957
   "metadata": {
958
    "deletable": true,
959
    "editable": true
960
   },
961
   "source": [
962
    "<font style=\"font-weight:bold;color:red\">Clean-Up</font>"
963
   ]
964
  },
965
  {
966
   "cell_type": "code",
967
   "execution_count": null,
968
   "metadata": {
969
    "collapsed": false,
970
    "deletable": true,
971
    "editable": true
972
   },
973
   "outputs": [],
974
   "source": [
975
    "del features[\"train\"]\n",
976
    "del features[\"test\"]\n",
977
    "gc.collect()"
978
   ]
979
  },
980
  {
981
   "cell_type": "markdown",
982
   "metadata": {
983
    "deletable": true,
984
    "editable": true
985
   },
986
   "source": [
987
    "### 5.5. Save Samples"
988
   ]
989
  },
990
  {
991
   "cell_type": "markdown",
992
   "metadata": {
993
    "deletable": true,
994
    "editable": true
995
   },
996
   "source": [
997
    "Serialise &amp; save the samples before any feature transformation. \n",
998
    "<br/>This snapshot of the samples may be used for the population profiling"
999
   ]
1000
  },
1001
  {
1002
   "cell_type": "code",
1003
   "execution_count": null,
1004
   "metadata": {
1005
    "collapsed": false,
1006
    "deletable": true,
1007
    "editable": true,
1008
    "scrolled": true
1009
   },
1010
   "outputs": [],
1011
   "source": [
1012
    "file_name = \"Step_05_Features\"\n",
1013
    "readers_writers.save_serialised_compressed(path=CONSTANTS.io_path, title=file_name, objects=features)\n",
1014
    "\n",
1015
    "# print\n",
1016
    "print(\"Number of columns: \", len(features[\"train_indep\"].columns), \n",
1017
    "      \"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
1018
   ]
1019
  },
1020
  {
1021
   "cell_type": "markdown",
1022
   "metadata": {
1023
    "deletable": true,
1024
    "editable": true
1025
   },
1026
   "source": [
1027
    "### 5.2. Remove - Near Zero Variance\n",
1028
    "In order to reduce sparseness and invalid features, highly stationary ones were withdrawn. The features that had constant counts less than or equal a threshold were \f",
1029
    "ltered out, to exclude highly constants and near-zero variances.\n",
1030
    "\n",
1031
    "The near zero variance rules are presented in below:\n",
1032
    "- Frequency ratio: The frequency of the most prevalent value over the second most frequent value to be greater than a threshold;\n",
1033
    "- Percent of unique values: The number of unique values divided by the total number of samples to be greater than the threshold\n",
1034
    "\n",
1035
    "<font style=\"font-weight:bold;color:red\">Configure:</font> the function\n",
1036
    "- The cutoff for the percentage of distinct values out of the number of total samples (upper limit). e.g. 10 * 100 / 100\n",
1037
    "<br/>&#9; &#8594; thresh_unique_cut\n",
1038
    "- The cutoff for the ratio of the most common value to the second most common value (lower limit). eg. 95/5\n",
1039
    "<br/>&#9; &#8594; thresh_freq_cut"
1040
   ]
1041
  },
1042
  {
1043
   "cell_type": "code",
1044
   "execution_count": null,
1045
   "metadata": {
1046
    "collapsed": false,
1047
    "deletable": true,
1048
    "editable": true
1049
   },
1050
   "outputs": [],
1051
   "source": [
1052
    "thresh_unique_cut = 100\n",
1053
    "thresh_freq_cut = 1000\n",
1054
    "\n",
1055
    "excludes = []\n",
1056
    "file_name = \"Step_05_Preprocess_NZV_config\"\n",
1057
    "features[\"train_indep\"], o_summaries = preprocess.near_zero_var_df(df=features[\"train_indep\"], \n",
1058
    "                                                             excludes=excludes, \n",
1059
    "                                                             file_name=file_name, \n",
1060
    "                                                             thresh_unique_cut=thresh_unique_cut, \n",
1061
    "                                                             thresh_freq_cut=thresh_freq_cut,\n",
1062
    "                                                             to_search=True)\n",
1063
    "\n",
1064
    "file_name = \"Step_05_Preprocess_NZV\"\n",
1065
    "readers_writers.save_text(path=CONSTANTS.io_path, title=file_name, data=o_summaries, append=False, ext=\"log\")\n",
1066
    "\n",
1067
    "file_name = \"Step_05_Preprocess_NZV_config\"\n",
1068
    "features[\"test_indep\"], o_summaries = preprocess.near_zero_var_df(df=features[\"test_indep\"], \n",
1069
    "                                                            excludes=excludes, \n",
1070
    "                                                            file_name=file_name, \n",
1071
    "                                                            thresh_unique_cut=thresh_unique_cut, \n",
1072
    "                                                            thresh_freq_cut=thresh_freq_cut,\n",
1073
    "                                                            to_search=False)\n",
1074
    "\n",
1075
    "# print\n",
1076
    "print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
1077
    "print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
1078
   ]
1079
  },
1080
  {
1081
   "cell_type": "markdown",
1082
   "metadata": {
1083
    "deletable": true,
1084
    "editable": true
1085
   },
1086
   "source": [
1087
    "### 5.3. Remove Highly Linearly Correlated\n",
1088
    "\n",
1089
    "In this step, features that were highly linearly correlated were excluded. \n",
1090
    "\n",
1091
    "<font style=\"font-weight:bold;color:red\">Configure:</font> the function\n",
1092
    "- A numeric value for the pair-wise absolute correlation cutoff. e.g. 0.95\n",
1093
    "<br/>&#9; &#8594; thresh_corr_cut"
1094
   ]
1095
  },
1096
  {
1097
   "cell_type": "code",
1098
   "execution_count": null,
1099
   "metadata": {
1100
    "collapsed": false,
1101
    "deletable": true,
1102
    "editable": true
1103
   },
1104
   "outputs": [],
1105
   "source": [
1106
    "thresh_corr_cut = 0.95\n",
1107
    "\n",
1108
    "excludes = list(features_types_group[\"CATEGORICAL\"])\n",
1109
    "file_name = \"Step_05_Preprocess_Corr_config\"\n",
1110
    "features[\"train_indep\"], o_summaries = preprocess.high_linear_correlation_df(df=features[\"train_indep\"], \n",
1111
    "                                                                       excludes=excludes, \n",
1112
    "                                                                       file_name=file_name, \n",
1113
    "                                                                       thresh_corr_cut=thresh_corr_cut,\n",
1114
    "                                                                       to_search=True)\n",
1115
    "\n",
1116
    "file_name = \"Step_05_Preprocess_Corr\"\n",
1117
    "readers_writers.save_text(path=CONSTANTS.io_path, title=file_name, data=o_summaries, append=False, ext=\"log\")\n",
1118
    "\n",
1119
    "file_name = \"Step_05_Preprocess_Corr_config\"\n",
1120
    "features[\"test_indep\"], o_summaries = preprocess.high_linear_correlation_df(df=features[\"test_indep\"], \n",
1121
    "                                                                      excludes=excludes, \n",
1122
    "                                                                      file_name=file_name, \n",
1123
    "                                                                      thresh_corr_cut=thresh_corr_cut,\n",
1124
    "                                                                      to_search=False)\n",
1125
    "\n",
1126
    "# print\n",
1127
    "print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
1128
    "print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
1129
   ]
1130
  },
1131
  {
1132
   "cell_type": "markdown",
1133
   "metadata": {
1134
    "deletable": true,
1135
    "editable": true
1136
   },
1137
   "source": [
1138
    "### 5.4. Descriptive Statistics"
1139
   ]
1140
  },
1141
  {
1142
   "cell_type": "markdown",
1143
   "metadata": {
1144
    "deletable": true,
1145
    "editable": true
1146
   },
1147
   "source": [
1148
    "Produce a descriptive stat report of 'Categorical', 'Continuous', & 'TARGET' features"
1149
   ]
1150
  },
1151
  {
1152
   "cell_type": "code",
1153
   "execution_count": null,
1154
   "metadata": {
1155
    "collapsed": true,
1156
    "deletable": true,
1157
    "editable": true
1158
   },
1159
   "outputs": [],
1160
   "source": [
1161
    "# columns\n",
1162
    "file_name = \"Step_05_Data_ColumnNames_Train\"\n",
1163
    "readers_writers.save_csv(path=CONSTANTS.io_path, title=file_name, \n",
1164
    "                         data=list(features[\"train_indep\"].columns.values), append=False)\n",
1165
    "\n",
1166
    "# Sample - Train\n",
1167
    "file_name = \"Step_05_Stats_Categorical_Train\"\n",
1168
    "o_stats = preprocess.stats_discrete_df(df=features[\"train_indep\"], includes=features_types_group[\"CATEGORICAL\"], \n",
1169
    "                                       file_name=file_name)\n",
1170
    "file_name = \"Step_05_Stats_Continuous_Train\"\n",
1171
    "o_stats = preprocess.stats_continuous_df(df=features[\"train_indep\"], includes=features_types_group[\"CONTINUOUS\"], \n",
1172
    "                                         file_name=file_name)\n",
1173
    "\n",
1174
    "# Sample - Test\n",
1175
    "file_name = \"Step_05_Stats_Categorical_Test\"\n",
1176
    "o_stats = preprocess.stats_discrete_df(df=features[\"test_indep\"], includes=features_types_group[\"CATEGORICAL\"],\n",
1177
    "                                       file_name=file_name)\n",
1178
    "file_name = \"Step_05_Stats_Continuous_Test\"\n",
1179
    "o_stats = preprocess.stats_continuous_df(df=features[\"test_indep\"], includes=features_types_group[\"CONTINUOUS\"], \n",
1180
    "                                         file_name=file_name)"
1181
   ]
1182
  },
1183
  {
1184
   "cell_type": "markdown",
1185
   "metadata": {
1186
    "deletable": true,
1187
    "editable": true
1188
   },
1189
   "source": [
1190
    "<br/><br/>"
1191
   ]
1192
  },
1193
  {
1194
   "cell_type": "markdown",
1195
   "metadata": {
1196
    "deletable": true,
1197
    "editable": true
1198
   },
1199
   "source": [
1200
    "## 6. Recategorise &amp; Transform"
1201
   ]
1202
  },
1203
  {
1204
   "cell_type": "markdown",
1205
   "metadata": {
1206
    "deletable": true,
1207
    "editable": true
1208
   },
1209
   "source": [
1210
    "Verify features visually"
1211
   ]
1212
  },
1213
  {
1214
   "cell_type": "code",
1215
   "execution_count": null,
1216
   "metadata": {
1217
    "collapsed": false,
1218
    "deletable": true,
1219
    "editable": true
1220
   },
1221
   "outputs": [],
1222
   "source": [
1223
    "display(pd.concat([features[\"train_id\"].head(), features[\"train_target\"].head(), features[\"train_indep\"].head()], axis=1))\n",
1224
    "display(pd.concat([features[\"test_id\"].head(), features[\"test_target\"].head(), features[\"test_indep\"].head()], axis=1))"
1225
   ]
1226
  },
1227
  {
1228
   "cell_type": "markdown",
1229
   "metadata": {
1230
    "deletable": true,
1231
    "editable": true
1232
   },
1233
   "source": [
1234
    "### 6.1. Recategorise"
1235
   ]
1236
  },
1237
  {
1238
   "cell_type": "markdown",
1239
   "metadata": {
1240
    "deletable": true,
1241
    "editable": true
1242
   },
1243
   "source": [
1244
    "Define the factorisation function to generate dummy features for the categorical features."
1245
   ]
1246
  },
1247
  {
1248
   "cell_type": "code",
1249
   "execution_count": null,
1250
   "metadata": {
1251
    "collapsed": true,
1252
    "deletable": true,
1253
    "editable": true
1254
   },
1255
   "outputs": [],
1256
   "source": [
1257
    "def factorise_settings(max_categories_frac, min_categories_num, exclude_zero):\n",
1258
    "    categories_dic = dict()\n",
1259
    "    labels_dic = dict()\n",
1260
    "    dtypes_dic = dict()\n",
1261
    "    dummies = []\n",
1262
    "    \n",
1263
    "    for f_name in features_types_group[\"CATEGORICAL\"]:\n",
1264
    "        if f_name in features[\"train_indep\"]:\n",
1265
    "            # find top & valid states\n",
1266
    "            summaries = stats.itemfreq(features[\"train_indep\"][f_name])\n",
1267
    "            summaries = pd.DataFrame({\"value\": summaries[:, 0], \"freq\": summaries[:, 1]})\n",
1268
    "            summaries[\"value\"] = list(map(int, summaries[\"value\"]))\n",
1269
    "            summaries = summaries.sort_values(\"freq\", ascending=False)\n",
1270
    "            summaries = list(summaries[\"value\"])\n",
1271
    "\n",
1272
    "            # exclude zero state\n",
1273
    "            if exclude_zero is True and len(summaries) > 1:\n",
1274
    "                summaries = [s for s in summaries if s != 0]\n",
1275
    "                \n",
1276
    "            # if included in the states\n",
1277
    "            summaries = [v for v in summaries if v in set(features_states_values[f_name])]\n",
1278
    "\n",
1279
    "            # limit number of states\n",
1280
    "            max_cnt = max(int(len(summaries) * max_categories_frac), min_categories_num)\n",
1281
    "\n",
1282
    "            # set states\n",
1283
    "            categories_dic[f_name] = summaries[0:max_cnt]\n",
1284
    "            labels_dic[f_name] = [f_name + \"_\" + str(c) for c in categories_dic[f_name]]\n",
1285
    "            dtypes_dic = {**dtypes_dic,\n",
1286
    "                          **dict(zip(labels_dic[f_name], [pd.Series(dtype='i') for _ in range(len(categories_dic[f_name]))]))}\n",
1287
    "            dummies += labels_dic[f_name] \n",
1288
    "                \n",
1289
    "    dtypes_dic = pd.DataFrame(dtypes_dic).dtypes\n",
1290
    "\n",
1291
    "    # print        \n",
1292
    "    print(\"Total Categorical Variables : \", len(categories_dic.keys()), \n",
1293
    "          \"; Total Number of Dummy Variables: \", sum([len(categories_dic[f_name]) for f_name in categories_dic.keys()]))\n",
1294
    "    return categories_dic, labels_dic, dtypes_dic, features_types"
1295
   ]
1296
  },
1297
  {
1298
   "cell_type": "markdown",
1299
   "metadata": {
1300
    "deletable": true,
1301
    "editable": true
1302
   },
1303
   "source": [
1304
    "Select categories: by order of freq., max_categories_frac, & max_categories_num\n",
1305
    "\n",
1306
    "<br/><font style=\"font-weight:bold;color:red\">Configure:</font> The input arguments are:\n",
1307
    "- Specify the maximum number of categories a feature can have\n",
1308
    "<br/>&#9; &#8594; max_categories_frac\n",
1309
    "- Specify the minimum number of categories a feature can have\n",
1310
    "<br/>&#9; &#8594; min_categories_num\n",
1311
    "- Specify to exclude the state '0' (zero). State zero in our features represents 'any other state', including NULL\n",
1312
    "<br/>&#9; &#8594; exclude_zero = False"
1313
   ]
1314
  },
1315
  {
1316
   "cell_type": "code",
1317
   "execution_count": null,
1318
   "metadata": {
1319
    "collapsed": false,
1320
    "deletable": true,
1321
    "editable": true
1322
   },
1323
   "outputs": [],
1324
   "source": [
1325
    "max_categories_frac = 0.90\n",
1326
    "min_categories_num = 1\n",
1327
    "exclude_zero = False # if possible remove state zero\n",
1328
    "\n",
1329
    "categories_dic, labels_dic, dtypes_dic, features_types_group[\"DUMMIES\"] = \\\n",
1330
    "    factorise_settings(max_categories_frac, min_categories_num, exclude_zero)"
1331
   ]
1332
  },
1333
  {
1334
   "cell_type": "markdown",
1335
   "metadata": {
1336
    "deletable": true,
1337
    "editable": true
1338
   },
1339
   "source": [
1340
    "Manually add dummy variables to the dataframe &amp; remove the original Categorical variables"
1341
   ]
1342
  },
1343
  {
1344
   "cell_type": "code",
1345
   "execution_count": null,
1346
   "metadata": {
1347
    "collapsed": false,
1348
    "deletable": true,
1349
    "editable": true
1350
   },
1351
   "outputs": [],
1352
   "source": [
1353
    "features[\"train_indep_temp\"] = preprocess.factoring_feature_wise(features[\"train_indep\"], categories_dic, labels_dic, dtypes_dic, threaded=False)\n",
1354
    "features[\"test_indep_temp\"] = preprocess.factoring_feature_wise(features[\"test_indep\"], categories_dic, labels_dic, dtypes_dic, threaded=False)\n",
1355
    "\n",
1356
    "# print\n",
1357
    "print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
1358
    "print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
1359
   ]
1360
  },
1361
  {
1362
   "cell_type": "markdown",
1363
   "metadata": {
1364
    "deletable": true,
1365
    "editable": true
1366
   },
1367
   "source": [
1368
    "Verify features visually"
1369
   ]
1370
  },
1371
  {
1372
   "cell_type": "code",
1373
   "execution_count": null,
1374
   "metadata": {
1375
    "collapsed": false,
1376
    "deletable": true,
1377
    "editable": true
1378
   },
1379
   "outputs": [],
1380
   "source": [
1381
    "display(pd.concat([features[\"train_id\"].head(), features[\"train_target\"].head(), features[\"train_indep_temp\"].head()], axis=1))\n",
1382
    "display(pd.concat([features[\"test_id\"].head(), features[\"test_target\"].head(), features[\"test_indep_temp\"].head()], axis=1))"
1383
   ]
1384
  },
1385
  {
1386
   "cell_type": "markdown",
1387
   "metadata": {
1388
    "deletable": true,
1389
    "editable": true
1390
   },
1391
   "source": [
1392
    "Set"
1393
   ]
1394
  },
1395
  {
1396
   "cell_type": "code",
1397
   "execution_count": null,
1398
   "metadata": {
1399
    "collapsed": true,
1400
    "deletable": true,
1401
    "editable": true
1402
   },
1403
   "outputs": [],
1404
   "source": [
1405
    "features[\"train_indep\"] = features[\"train_indep_temp\"].copy(True)\n",
1406
    "features[\"test_indep\"] = features[\"test_indep_temp\"].copy(True)"
1407
   ]
1408
  },
1409
  {
1410
   "cell_type": "markdown",
1411
   "metadata": {
1412
    "deletable": true,
1413
    "editable": true
1414
   },
1415
   "source": [
1416
    "<font style=\"font-weight:bold;color:red\">Clean-Up</font>"
1417
   ]
1418
  },
1419
  {
1420
   "cell_type": "code",
1421
   "execution_count": null,
1422
   "metadata": {
1423
    "collapsed": false,
1424
    "deletable": true,
1425
    "editable": true
1426
   },
1427
   "outputs": [],
1428
   "source": [
1429
    "del features[\"train_indep_temp\"]\n",
1430
    "del features[\"test_indep_temp\"]\n",
1431
    "gc.collect()"
1432
   ]
1433
  },
1434
  {
1435
   "cell_type": "markdown",
1436
   "metadata": {
1437
    "deletable": true,
1438
    "editable": true
1439
   },
1440
   "source": [
1441
    "### 6.2. Remove - Near Zero Variance"
1442
   ]
1443
  },
1444
  {
1445
   "cell_type": "markdown",
1446
   "metadata": {
1447
    "deletable": true,
1448
    "editable": true
1449
   },
1450
   "source": [
1451
    "Optional: Remove more features with near zero variance, after the factorisation step.\n",
1452
    "<font style=\"font-weight:bold;color:red\">Configure:</font> the function"
1453
   ]
1454
  },
1455
  {
1456
   "cell_type": "code",
1457
   "execution_count": null,
1458
   "metadata": {
1459
    "collapsed": false,
1460
    "deletable": true,
1461
    "editable": true
1462
   },
1463
   "outputs": [],
1464
   "source": [
1465
    "# the cutoff for the percentage of distinct values out of the number of total samples (upper limit). e.g. 10 * 100 / 100\n",
1466
    "thresh_unique_cut = 100\n",
1467
    "# the cutoff for the ratio of the most common value to the second most common value (lower limit). eg. 95/5\n",
1468
    "thresh_freq_cut = 1000\n",
1469
    "\n",
1470
    "excludes = []\n",
1471
    "file_name = \"Step_06_Preprocess_NZV_config\"\n",
1472
    "features[\"train_indep\"], o_summaries = preprocess.near_zero_var_df(df=features[\"train_indep\"], \n",
1473
    "                                                             excludes=excludes, \n",
1474
    "                                                             file_name=file_name, \n",
1475
    "                                                             thresh_unique_cut=thresh_unique_cut, \n",
1476
    "                                                             thresh_freq_cut=thresh_freq_cut,\n",
1477
    "                                                             to_search=True)\n",
1478
    "\n",
1479
    "file_name = \"Step_06_Preprocess_NZV\"\n",
1480
    "readers_writers.save_text(path=CONSTANTS.io_path, title=file_name, data=o_summaries, append=False, ext=\"log\")\n",
1481
    "\n",
1482
    "file_name = \"Step_06_Preprocess_NZV_config\"\n",
1483
    "features[\"test_indep\"], o_summaries = preprocess.near_zero_var_df(df=features[\"test_indep\"], \n",
1484
    "                                                            excludes=excludes, \n",
1485
    "                                                            file_name=file_name, \n",
1486
    "                                                            thresh_unique_cut=thresh_unique_cut, \n",
1487
    "                                                            thresh_freq_cut=thresh_freq_cut,\n",
1488
    "                                                            to_search=False)\n",
1489
    "\n",
1490
    "# print\n",
1491
    "print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
1492
    "print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
1493
   ]
1494
  },
1495
  {
1496
   "cell_type": "markdown",
1497
   "metadata": {
1498
    "deletable": true,
1499
    "editable": true
1500
   },
1501
   "source": [
1502
    "### 6.3. Remove Highly Linearly Correlated"
1503
   ]
1504
  },
1505
  {
1506
   "cell_type": "markdown",
1507
   "metadata": {
1508
    "deletable": true,
1509
    "editable": true
1510
   },
1511
   "source": [
1512
    "Optional: Remove more features with highly linearly correlated, after the factorisation step.\n",
1513
    "<font style=\"font-weight:bold;color:red\">Configure:</font> the function"
1514
   ]
1515
  },
1516
  {
1517
   "cell_type": "code",
1518
   "execution_count": null,
1519
   "metadata": {
1520
    "collapsed": false,
1521
    "deletable": true,
1522
    "editable": true
1523
   },
1524
   "outputs": [],
1525
   "source": [
1526
    "# A numeric value for the pair-wise absolute correlation cutoff. e.g. 0.95\n",
1527
    "thresh_corr_cut = 0.95\n",
1528
    "\n",
1529
    "excludes = []\n",
1530
    "file_name = \"Step_06_Preprocess_Corr_config\"\n",
1531
    "features[\"train_indep\"], o_summaries = preprocess.high_linear_correlation_df(df=features[\"train_indep\"], \n",
1532
    "                                                                       excludes=excludes, \n",
1533
    "                                                                       file_name=file_name, \n",
1534
    "                                                                       thresh_corr_cut=thresh_corr_cut,\n",
1535
    "                                                                       to_search=True)\n",
1536
    "\n",
1537
    "file_name = \"Step_06_Preprocess_Corr\"\n",
1538
    "readers_writers.save_text(path=CONSTANTS.io_path, title=file_name, data=o_summaries, append=False, ext=\"log\")\n",
1539
    "\n",
1540
    "file_name = \"Step_06_Preprocess_Corr_config\"\n",
1541
    "features[\"test_indep\"], o_summaries = preprocess.high_linear_correlation_df(df=features[\"test_indep\"], \n",
1542
    "                                                                      excludes=excludes, \n",
1543
    "                                                                      file_name=file_name, \n",
1544
    "                                                                      thresh_corr_cut=thresh_corr_cut,\n",
1545
    "                                                                      to_search=False)\n",
1546
    "\n",
1547
    "# print\n",
1548
    "print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
1549
    "print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
1550
   ]
1551
  },
1552
  {
1553
   "cell_type": "markdown",
1554
   "metadata": {
1555
    "deletable": true,
1556
    "editable": true
1557
   },
1558
   "source": [
1559
    "### 6.4. Descriptive Statsistics"
1560
   ]
1561
  },
1562
  {
1563
   "cell_type": "markdown",
1564
   "metadata": {
1565
    "deletable": true,
1566
    "editable": true
1567
   },
1568
   "source": [
1569
    "Produce a descriptive stat report of 'Categorical', 'Continuous', & 'TARGET' features"
1570
   ]
1571
  },
1572
  {
1573
   "cell_type": "code",
1574
   "execution_count": null,
1575
   "metadata": {
1576
    "collapsed": true,
1577
    "deletable": true,
1578
    "editable": true
1579
   },
1580
   "outputs": [],
1581
   "source": [
1582
    "# columns\n",
1583
    "file_name = \"Step_06_4_Data_ColumnNames_Train\"\n",
1584
    "readers_writers.save_csv(path=CONSTANTS.io_path, title=file_name, \n",
1585
    "                         data=list(features[\"train_indep\"].columns.values), append=False)\n",
1586
    "\n",
1587
    "# Sample - Train\n",
1588
    "file_name = \"Step_06_4_Stats_Categorical_Train\"\n",
1589
    "o_stats = preprocess.stats_discrete_df(df=features[\"train_indep\"], includes=features_types_group[\"CATEGORICAL\"], \n",
1590
    "                                       file_name=file_name)\n",
1591
    "file_name = \"Step_06_4_Stats_Continuous_Train\"\n",
1592
    "o_stats = preprocess.stats_continuous_df(df=features[\"train_indep\"], includes=features_types_group[\"CONTINUOUS\"], \n",
1593
    "                                         file_name=file_name)\n",
1594
    "\n",
1595
    "# Sample - Test\n",
1596
    "file_name = \"Step_06_4_Stats_Categorical_Test\"\n",
1597
    "o_stats = preprocess.stats_discrete_df(df=features[\"test_indep\"], includes=features_types_group[\"CATEGORICAL\"],\n",
1598
    "                                       file_name=file_name)\n",
1599
    "file_name = \"Step_06_4_Stats_Continuous_Test\"\n",
1600
    "o_stats = preprocess.stats_continuous_df(df=features[\"test_indep\"], includes=features_types_group[\"CONTINUOUS\"], \n",
1601
    "                                         file_name=file_name)"
1602
   ]
1603
  },
1604
  {
1605
   "cell_type": "markdown",
1606
   "metadata": {
1607
    "collapsed": true,
1608
    "deletable": true,
1609
    "editable": true
1610
   },
1611
   "source": [
1612
    "### 6.5. Transformations"
1613
   ]
1614
  },
1615
  {
1616
   "cell_type": "markdown",
1617
   "metadata": {
1618
    "deletable": true,
1619
    "editable": true
1620
   },
1621
   "source": [
1622
    "Verify features visually"
1623
   ]
1624
  },
1625
  {
1626
   "cell_type": "code",
1627
   "execution_count": null,
1628
   "metadata": {
1629
    "collapsed": false,
1630
    "deletable": true,
1631
    "editable": true
1632
   },
1633
   "outputs": [],
1634
   "source": [
1635
    "display(pd.concat([features[\"train_id\"].head(), features[\"train_target\"].head(), features[\"train_indep\"].head()], axis=1))\n",
1636
    "display(pd.concat([features[\"test_id\"].head(), features[\"test_target\"].head(), features[\"test_indep\"].head()], axis=1))"
1637
   ]
1638
  },
1639
  {
1640
   "cell_type": "markdown",
1641
   "metadata": {
1642
    "collapsed": true,
1643
    "deletable": true,
1644
    "editable": true
1645
   },
1646
   "source": [
1647
    "<font style=\"font-weight:bold;color:blue\">Tranformation:</font> scale\n",
1648
    "<font style=\"font-weight:bold;color:brown\">Note:</font>: It is highly resource intensive"
1649
   ]
1650
  },
1651
  {
1652
   "cell_type": "code",
1653
   "execution_count": null,
1654
   "metadata": {
1655
    "collapsed": false,
1656
    "deletable": true,
1657
    "editable": true
1658
   },
1659
   "outputs": [],
1660
   "source": [
1661
    "transform_type = \"scale\"\n",
1662
    "kwargs = {\"with_mean\": True}\n",
1663
    "method_args = dict()\n",
1664
    "excludes = list(features_types_group[\"CATEGORICAL\"]) + list(features_types_group[\"DUMMIES\"])\n",
1665
    "\n",
1666
    "features[\"train_indep\"], method_args = preprocess.transform_df(df=features[\"train_indep\"], excludes=excludes, \n",
1667
    "                                                               transform_type=transform_type, threaded=False, \n",
1668
    "                                                               method_args=method_args, **kwargs)\n",
1669
    "features[\"test_indep\"], _ = preprocess.transform_df(df=features[\"test_indep\"], excludes=excludes, \n",
1670
    "                                                    transform_type=transform_type, threaded=False, \n",
1671
    "                                                    method_args=method_args, **kwargs)\n",
1672
    "\n",
1673
    "# print(\"Metod arguments:\", method_args)"
1674
   ]
1675
  },
1676
  {
1677
   "cell_type": "markdown",
1678
   "metadata": {
1679
    "deletable": true,
1680
    "editable": true
1681
   },
1682
   "source": [
1683
    "<font style=\"font-weight:bold;color:blue\">Tranformation:</font> Yeo-Johnson\n",
1684
    "<font style=\"font-weight:bold;color:brown\">Note:</font>: It is highly resource intensive"
1685
   ]
1686
  },
1687
  {
1688
   "cell_type": "code",
1689
   "execution_count": null,
1690
   "metadata": {
1691
    "collapsed": false,
1692
    "deletable": true,
1693
    "editable": true
1694
   },
1695
   "outputs": [],
1696
   "source": [
1697
    "transform_type = \"yeo_johnson\"\n",
1698
    "kwargs = {\"lmbda\": -0.5, \"derivative\": 0, \"epsilon\": np.finfo(np.float).eps, \"inverse\": False}\n",
1699
    "method_args = dict()\n",
1700
    "excludes = list(features_types_group[\"CATEGORICAL\"]) + list(features_types_group[\"DUMMIES\"])\n",
1701
    "\n",
1702
    "features[\"train_indep\"], method_args = preprocess.transform_df(df=features[\"train_indep\"], excludes=excludes, \n",
1703
    "                                                               transform_type=transform_type, threaded=False, \n",
1704
    "                                                               method_args=method_args, **kwargs)\n",
1705
    "features[\"test_indep\"], _ = preprocess.transform_df(df=features[\"test_indep\"], excludes=excludes, \n",
1706
    "                                                    transform_type=transform_type, threaded=False, \n",
1707
    "                                                    method_args=method_args, **kwargs)\n",
1708
    "\n",
1709
    "# print(\"Metod arguments:\", method_args)"
1710
   ]
1711
  },
1712
  {
1713
   "cell_type": "markdown",
1714
   "metadata": {
1715
    "deletable": true,
1716
    "editable": true
1717
   },
1718
   "source": [
1719
    "Visual verification"
1720
   ]
1721
  },
1722
  {
1723
   "cell_type": "code",
1724
   "execution_count": null,
1725
   "metadata": {
1726
    "collapsed": false,
1727
    "deletable": true,
1728
    "editable": true
1729
   },
1730
   "outputs": [],
1731
   "source": [
1732
    "display(pd.concat([features[\"train_id\"].head(), features[\"train_target\"].head(), features[\"train_indep\"].head()], axis=1))\n",
1733
    "display(pd.concat([features[\"test_id\"].head(), features[\"test_target\"].head(), features[\"test_indep\"].head()], axis=1))"
1734
   ]
1735
  },
1736
  {
1737
   "cell_type": "markdown",
1738
   "metadata": {
1739
    "deletable": true,
1740
    "editable": true
1741
   },
1742
   "source": [
1743
    "### 6.6. Summary Statistics"
1744
   ]
1745
  },
1746
  {
1747
   "cell_type": "markdown",
1748
   "metadata": {
1749
    "deletable": true,
1750
    "editable": true
1751
   },
1752
   "source": [
1753
    "Produce a descriptive stat report of 'Categorical', 'Continuous', & 'TARGET' features"
1754
   ]
1755
  },
1756
  {
1757
   "cell_type": "code",
1758
   "execution_count": null,
1759
   "metadata": {
1760
    "collapsed": true,
1761
    "deletable": true,
1762
    "editable": true
1763
   },
1764
   "outputs": [],
1765
   "source": [
1766
    "# Statsistics report for 'Categorical', 'Continuous', & 'TARGET' variables\n",
1767
    "# columns\n",
1768
    "file_name = \"Step_06_6_Data_ColumnNames_Train\"\n",
1769
    "readers_writers.save_csv(path=CONSTANTS.io_path, title=file_name, \n",
1770
    "                         data=list(features[\"train_indep\"].columns.values), append=False)\n",
1771
    "\n",
1772
    "# Sample - Train\n",
1773
    "file_name = \"Step_06_6_Stats_Categorical_Train\"\n",
1774
    "o_stats = preprocess.stats_discrete_df(df=features[\"train_indep\"], includes=features_types_group[\"CATEGORICAL\"], \n",
1775
    "                                       file_name=file_name)\n",
1776
    "file_name = \"Step_06_6_Stats_Continuous_Train\"\n",
1777
    "o_stats = preprocess.stats_continuous_df(df=features[\"train_indep\"], includes=features_types_group[\"CONTINUOUS\"], \n",
1778
    "                                         file_name=file_name)\n",
1779
    "\n",
1780
    "# Sample - Test\n",
1781
    "file_name = \"Step_06_6_Stats_Categorical_Test\"\n",
1782
    "o_stats = preprocess.stats_discrete_df(df=features[\"test_indep\"], includes=features_types_group[\"CATEGORICAL\"],\n",
1783
    "                                       file_name=file_name)\n",
1784
    "file_name = \"Step_06_6_Stats_Continuous_Test\"\n",
1785
    "o_stats = preprocess.stats_continuous_df(df=features[\"test_indep\"], includes=features_types_group[\"CONTINUOUS\"], \n",
1786
    "                                         file_name=file_name)"
1787
   ]
1788
  },
1789
  {
1790
   "cell_type": "markdown",
1791
   "metadata": {
1792
    "deletable": true,
1793
    "editable": true
1794
   },
1795
   "source": [
1796
    "<br/><br/>"
1797
   ]
1798
  },
1799
  {
1800
   "cell_type": "markdown",
1801
   "metadata": {
1802
    "deletable": true,
1803
    "editable": true
1804
   },
1805
   "source": [
1806
    "## 7. Rank &amp; Select Features"
1807
   ]
1808
  },
1809
  {
1810
   "cell_type": "markdown",
1811
   "metadata": {
1812
    "deletable": true,
1813
    "editable": true
1814
   },
1815
   "source": [
1816
    "<font style=\"font-weight:bold;color:red\">Configure:</font> the general settings"
1817
   ]
1818
  },
1819
  {
1820
   "cell_type": "code",
1821
   "execution_count": null,
1822
   "metadata": {
1823
    "collapsed": true,
1824
    "deletable": true,
1825
    "editable": true
1826
   },
1827
   "outputs": [],
1828
   "source": [
1829
    "# select the target variable\n",
1830
    "target_feature = \"label365\" # \"label30\", \"label365\"\n",
1831
    "\n",
1832
    "# number of trials\n",
1833
    "num_trials = 1\n",
1834
    "\n",
1835
    "model_rank = dict()\n",
1836
    "o_summaries_df = dict()"
1837
   ]
1838
  },
1839
  {
1840
   "cell_type": "markdown",
1841
   "metadata": {
1842
    "deletable": true,
1843
    "editable": true
1844
   },
1845
   "source": [
1846
    "### 7.1. Define"
1847
   ]
1848
  },
1849
  {
1850
   "cell_type": "markdown",
1851
   "metadata": {
1852
    "deletable": true,
1853
    "editable": true
1854
   },
1855
   "source": [
1856
    "<font style=\"font-weight:bold;color:blue\">Ranking Method:</font> Random forest classifier (Brieman)\n",
1857
    "<br/>Define a set of classifiers with different settings, to be used in feature ranking trials."
1858
   ]
1859
  },
1860
  {
1861
   "cell_type": "code",
1862
   "execution_count": null,
1863
   "metadata": {
1864
    "collapsed": true,
1865
    "deletable": true,
1866
    "editable": true
1867
   },
1868
   "outputs": [],
1869
   "source": [
1870
    "def rank_random_forest_brieman(features_indep_arg, features_target_arg, num_trials):\n",
1871
    "    num_settings = 3\n",
1872
    "    o_summaries_df = [pd.DataFrame({'Name': list(features_indep_arg.columns.values)}) for _ in range(num_trials * num_settings)]\n",
1873
    "    model_rank = [None] * (num_trials * num_settings)\n",
1874
    "\n",
1875
    "    # trials \n",
1876
    "    for i in range(num_trials):   \n",
1877
    "        print(\"Trial: \" + str(i))\n",
1878
    "        # setting-1\n",
1879
    "        s_i = i\n",
1880
    "        model_rank[s_i] = feature_selection.rank_random_forest_breiman(\n",
1881
    "            features_indep_arg.values, features_target_arg.values,\n",
1882
    "            **{\"n_estimators\": 10, \"criterion\": 'gini', \"max_depth\": None, \"min_samples_split\": 2, \"min_samples_leaf\": 1,\n",
1883
    "            \"min_weight_fraction_leaf\": 0.0, \"max_features\": 'auto', \"max_leaf_nodes\": None, \"bootstrap\": True,\n",
1884
    "            \"oob_score\": False, \"n_jobs\": -1, \"random_state\": None, \"verbose\": 0, \"warm_start\": False, \"class_weight\": None})\n",
1885
    "\n",
1886
    "        # setting-2\n",
1887
    "        s_i = num_trials + i\n",
1888
    "        model_rank[s_i] = feature_selection.rank_random_forest_breiman(\n",
1889
    "            features_indep_arg.values, features_target_arg.values,\n",
1890
    "            **{\"n_estimators\": 10, \"criterion\": 'gini', \"max_depth\": None, \"min_samples_split\": 50, \"min_samples_leaf\": 25,\n",
1891
    "            \"min_weight_fraction_leaf\": 0.0, \"max_features\": 'auto', \"max_leaf_nodes\": None, \"bootstrap\": True,\n",
1892
    "            \"oob_score\": False, \"n_jobs\": -1, \"random_state\": None, \"verbose\": 0, \"warm_start\": False, \"class_weight\": None})\n",
1893
    "\n",
1894
    "        # setting-3\n",
1895
    "        s_i = (num_trials * 2) + i\n",
1896
    "        model_rank[s_i] = feature_selection.rank_random_forest_breiman(\n",
1897
    "            features_indep_arg.values, features_target_arg.values,\n",
1898
    "            **{\"n_estimators\": 10, \"criterion\": 'gini', \"max_depth\": None, \"min_samples_split\": 40, \"min_samples_leaf\": 20,\n",
1899
    "            \"min_weight_fraction_leaf\": 0.0, \"max_features\": 'auto', \"max_leaf_nodes\": None, \"bootstrap\": True,\n",
1900
    "            \"oob_score\": False, \"n_jobs\": -1, \"random_state\": None, \"verbose\": 0, \"warm_start\": True, \"class_weight\": None})\n",
1901
    "\n",
1902
    "    for i in range((num_trials * num_settings)):\n",
1903
    "        o_summaries_df[i]['Importance'] = list(model_rank[i].feature_importances_)\n",
1904
    "        o_summaries_df[i] = o_summaries_df[i].sort_values(['Importance'], ascending = [0])\n",
1905
    "        o_summaries_df[i] = o_summaries_df[i].reset_index(drop = True)\n",
1906
    "        o_summaries_df[i]['Order'] = range(1, len(o_summaries_df[i]['Importance']) + 1)\n",
1907
    "    return model_rank, o_summaries_df"
1908
   ]
1909
  },
1910
  {
1911
   "cell_type": "markdown",
1912
   "metadata": {
1913
    "deletable": true,
1914
    "editable": true
1915
   },
1916
   "source": [
1917
    "<font style=\"font-weight:bold;color:blue\">Ranking Method:</font> Gradient Boosted Regression Trees (GBRT) \n",
1918
    "<br/>Define a set of classifiers with different settings, to be used in feature ranking trials."
1919
   ]
1920
  },
1921
  {
1922
   "cell_type": "code",
1923
   "execution_count": null,
1924
   "metadata": {
1925
    "collapsed": true,
1926
    "deletable": true,
1927
    "editable": true
1928
   },
1929
   "outputs": [],
1930
   "source": [
1931
    "def rank_gbrt(features_indep_arg, features_target_arg, num_trials):\n",
1932
    "    num_settings = 3\n",
1933
    "    o_summaries_df = [pd.DataFrame({'Name': list(features_indep_arg.columns.values)}) for _ in range(num_trials * num_settings)]\n",
1934
    "    model_rank = [None] * (num_trials * num_settings)\n",
1935
    "\n",
1936
    "    # trials \n",
1937
    "    for i in range(num_trials):   \n",
1938
    "        print(\"Trial: \" + str(i))\n",
1939
    "        # setting-1\n",
1940
    "        s_i = i\n",
1941
    "        model_rank[s_i] = feature_selection.rank_tree_gbrt(\n",
1942
    "            features_indep_arg.values, features_target_arg.values, \n",
1943
    "            **{\"loss\": 'ls', \"learning_rate\": 0.1, \"n_estimators\": 100, \"subsample\": 1.0, \"min_samples_split\": 2, \"min_samples_leaf\": 1,\n",
1944
    "            \"min_weight_fraction_leaf\": 0.0, \"max_depth\": 10, \"init\": None, \"random_state\": None, \"max_features\": None, \"alpha\": 0.9,\n",
1945
    "            \"verbose\": 0, \"max_leaf_nodes\": None, \"warm_start\": False, \"presort\": True})\n",
1946
    "        \n",
1947
    "        # setting-2\n",
1948
    "        s_i = num_trials + i\n",
1949
    "        model_rank[s_i] = feature_selection.rank_tree_gbrt(\n",
1950
    "            features_indep_arg.values, features_target_arg.values,\n",
1951
    "            **{\"loss\": 'ls', \"learning_rate\": 0.1, \"n_estimators\": 100, \"subsample\": 1.0, \"min_samples_split\": 2, \"min_samples_leaf\": 1,\n",
1952
    "            \"min_weight_fraction_leaf\": 0.0, \"max_depth\": 5, \"init\": None, \"random_state\": None, \"max_features\": None, \"alpha\": 0.9,\n",
1953
    "            \"verbose\": 0, \"max_leaf_nodes\": None, \"warm_start\": False, \"presort\": True})\n",
1954
    "\n",
1955
    "        # setting-3\n",
1956
    "        s_i = (num_trials * 2) + i\n",
1957
    "        model_rank[s_i] = feature_selection.rank_tree_gbrt(\n",
1958
    "            features_indep_arg.values, features_target_arg.values,\n",
1959
    "            **{\"loss\": 'ls', \"learning_rate\": 0.1, \"n_estimators\": 100, \"subsample\": 1.0, \"min_samples_split\": 2, \"min_samples_leaf\": 1,\n",
1960
    "            \"min_weight_fraction_leaf\": 0.0, \"max_depth\": 3, \"init\": None, \"random_state\": None, \"max_features\": None, \"alpha\": 0.9,\n",
1961
    "            \"verbose\": 0, \"max_leaf_nodes\": None, \"warm_start\": False, \"presort\": True})\n",
1962
    "\n",
1963
    "    for i in range((num_trials * num_settings)):\n",
1964
    "        o_summaries_df[i]['Importance'] = list(model_rank[i].feature_importances_)\n",
1965
    "        o_summaries_df[i] = o_summaries_df[i].sort_values(['Importance'], ascending = [0])\n",
1966
    "        o_summaries_df[i] = o_summaries_df[i].reset_index(drop = True)\n",
1967
    "        o_summaries_df[i]['Order'] = range(1, len(o_summaries_df[i]['Importance']) + 1)\n",
1968
    "    return model_rank, o_summaries_df"
1969
   ]
1970
  },
1971
  {
1972
   "cell_type": "markdown",
1973
   "metadata": {
1974
    "deletable": true,
1975
    "editable": true
1976
   },
1977
   "source": [
1978
    "<font style=\"font-weight:bold;color:blue\">Ranking Method:</font> Randomized Logistic Regression\n",
1979
    "<br/>Define a set of classifiers with different settings, to be used in feature ranking trials."
1980
   ]
1981
  },
1982
  {
1983
   "cell_type": "code",
1984
   "execution_count": null,
1985
   "metadata": {
1986
    "collapsed": true,
1987
    "deletable": true,
1988
    "editable": true
1989
   },
1990
   "outputs": [],
1991
   "source": [
1992
    "def rank_randLogit(features_indep_arg, features_target_arg, num_trials):\n",
1993
    "    num_settings = 3\n",
1994
    "    o_summaries_df = [pd.DataFrame({'Name': list(features_indep_arg.columns.values)}) for _ in range(num_trials * num_settings)]\n",
1995
    "    model_rank = [None] * (num_trials * num_settings)\n",
1996
    "\n",
1997
    "    # trials \n",
1998
    "    for i in range(num_trials):   \n",
1999
    "        print(\"Trial: \" + str(i))\n",
2000
    "        # setting-1\n",
2001
    "        s_i = i\n",
2002
    "        model_rank[s_i] = feature_selection.rank_random_logistic_regression(\n",
2003
    "            features_indep_arg.values, features_target_arg.values,\n",
2004
    "            **{\"C\": 1, \"scaling\": 0.5, \"sample_fraction\": 0.75, \"n_resampling\": 200, \"selection_threshold\": 0.25, \"tol\": 0.001,\n",
2005
    "            \"fit_intercept\": True, \"verbose\": False, \"normalize\": True, \"random_state\": None, \"n_jobs\": 1, \"pre_dispatch\": '3*n_jobs'})\n",
2006
    "\n",
2007
    "        # setting-2\n",
2008
    "        s_i = num_trials + i\n",
2009
    "        model_rank[s_i] = feature_selection.rank_random_logistic_regression(\n",
2010
    "            features_indep_arg.values, features_target_arg.values,\n",
2011
    "            **{\"C\": 1, \"scaling\": 0.5, \"sample_fraction\": 0.50, \"n_resampling\": 200, \"selection_threshold\": 0.25, \"tol\": 0.001,\n",
2012
    "            \"fit_intercept\": True, \"verbose\": False, \"normalize\": True, \"random_state\": None, \"n_jobs\": 1, \"pre_dispatch\": '3*n_jobs'})\n",
2013
    "\n",
2014
    "        # setting-3\n",
2015
    "        s_i = (num_trials * 2) + i\n",
2016
    "        model_rank[s_i] = feature_selection.rank_random_logistic_regression(\n",
2017
    "            features_indep_arg.values, features_target_arg.values,\n",
2018
    "            **{\"C\": 1, \"scaling\": 0.5, \"sample_fraction\": 0.90, \"n_resampling\": 200, \"selection_threshold\": 0.25, \"tol\": 0.001,\n",
2019
    "            \"fit_intercept\": True, \"verbose\": False, \"normalize\": True, \"random_state\": None, \"n_jobs\": 1, \"pre_dispatch\": '3*n_jobs'})\n",
2020
    "                \n",
2021
    "    for i in range((num_trials * num_settings)):\n",
2022
    "        o_summaries_df[i]['Importance'] = list(model_rank[i].scores_)\n",
2023
    "        o_summaries_df[i] = o_summaries_df[i].sort_values(['Importance'], ascending = [0])\n",
2024
    "        o_summaries_df[i] = o_summaries_df[i].reset_index(drop = True)\n",
2025
    "        o_summaries_df[i]['Order'] = range(1, len(o_summaries_df[i]['Importance']) + 1)\n",
2026
    "    return model_rank, o_summaries_df"
2027
   ]
2028
  },
2029
  {
2030
   "cell_type": "markdown",
2031
   "metadata": {
2032
    "deletable": true,
2033
    "editable": true
2034
   },
2035
   "source": [
2036
    "### 7.2. Run"
2037
   ]
2038
  },
2039
  {
2040
   "cell_type": "markdown",
2041
   "metadata": {
2042
    "deletable": true,
2043
    "editable": true
2044
   },
2045
   "source": [
2046
    "Run one or more feature ranking methods and trials"
2047
   ]
2048
  },
2049
  {
2050
   "cell_type": "markdown",
2051
   "metadata": {
2052
    "deletable": true,
2053
    "editable": true
2054
   },
2055
   "source": [
2056
    "<font style=\"font-weight:bold;color:blue\">Ranking Method:</font> Random forest classifier (Brieman)\n",
2057
    "<font style=\"font-weight:bold;color:brown\">Note:</font>: It is moderately resource intensive"
2058
   ]
2059
  },
2060
  {
2061
   "cell_type": "code",
2062
   "execution_count": null,
2063
   "metadata": {
2064
    "collapsed": false,
2065
    "deletable": true,
2066
    "editable": true
2067
   },
2068
   "outputs": [],
2069
   "source": [
2070
    "rank_model = \"rfc\"\n",
2071
    "model_rank[rank_model] = dict() \n",
2072
    "o_summaries_df[rank_model] = dict() \n",
2073
    "model_rank[rank_model], o_summaries_df[rank_model] = rank_random_forest_brieman(\n",
2074
    "    features[\"train_indep\"], features[\"train_target\"][target_feature], num_trials)"
2075
   ]
2076
  },
2077
  {
2078
   "cell_type": "markdown",
2079
   "metadata": {
2080
    "deletable": true,
2081
    "editable": true
2082
   },
2083
   "source": [
2084
    "<font style=\"font-weight:bold;color:blue\">Ranking Method:</font> Gradient Boosted Regression Trees (GBRT)\n",
2085
    "<font style=\"font-weight:bold;color:brown\">Note:</font>: It is moderately resource intensive"
2086
   ]
2087
  },
2088
  {
2089
   "cell_type": "code",
2090
   "execution_count": null,
2091
   "metadata": {
2092
    "collapsed": false,
2093
    "deletable": true,
2094
    "editable": true
2095
   },
2096
   "outputs": [],
2097
   "source": [
2098
    "rank_model = \"gbrt\"\n",
2099
    "model_rank[rank_model] = dict() \n",
2100
    "o_summaries_df[rank_model] = dict() \n",
2101
    "model_rank[rank_model], o_summaries_df[rank_model] = rank_gbrt(\n",
2102
    "    features[\"train_indep\"], features[\"train_target\"][target_feature], num_trials)"
2103
   ]
2104
  },
2105
  {
2106
   "cell_type": "markdown",
2107
   "metadata": {
2108
    "deletable": true,
2109
    "editable": true
2110
   },
2111
   "source": [
2112
    "<font style=\"font-weight:bold;color:blue\">Ranking Method</font>: Randomized Logistic Regression\n",
2113
    "<font style=\"font-weight:bold;color:brown\">Note:</font>: It is moderately resource intensive"
2114
   ]
2115
  },
2116
  {
2117
   "cell_type": "code",
2118
   "execution_count": null,
2119
   "metadata": {
2120
    "collapsed": false,
2121
    "deletable": true,
2122
    "editable": true
2123
   },
2124
   "outputs": [],
2125
   "source": [
2126
    "rank_model = \"randLogit\"\n",
2127
    "model_rank[rank_model] = dict() \n",
2128
    "o_summaries_df[rank_model] = dict() \n",
2129
    "model_rank[rank_model], o_summaries_df[rank_model] = rank_randLogit(\n",
2130
    "    features[\"train_indep\"], features[\"train_target\"][target_feature], num_trials)"
2131
   ]
2132
  },
2133
  {
2134
   "cell_type": "markdown",
2135
   "metadata": {
2136
    "deletable": true,
2137
    "editable": true
2138
   },
2139
   "source": [
2140
    "### 7.3. Summaries"
2141
   ]
2142
  },
2143
  {
2144
   "cell_type": "code",
2145
   "execution_count": null,
2146
   "metadata": {
2147
    "collapsed": true,
2148
    "deletable": true,
2149
    "editable": true
2150
   },
2151
   "outputs": [],
2152
   "source": [
2153
    "# combine scores\n",
2154
    "def rank_summarise (features_arg, o_summaries_df_arg):\n",
2155
    "    summaries_temp = {'Order_avg': [], 'Order_max': [],  'Order_min': [], 'Importance_avg': []}\n",
2156
    "    summary_order = []\n",
2157
    "    summary_importance = []\n",
2158
    "    \n",
2159
    "    for f_name in list(features_arg.columns.values):\n",
2160
    "        for i in range(len(o_summaries_df_arg)):\n",
2161
    "            summary_order.append(o_summaries_df_arg[i][o_summaries_df_arg[i]['Name'] == f_name]['Order'].values)\n",
2162
    "            summary_importance.append(o_summaries_df_arg[i][o_summaries_df_arg[i]['Name'] == f_name]['Importance'].values)\n",
2163
    "\n",
2164
    "        summaries_temp['Order_avg'].append(statistics.mean(np.concatenate(summary_order)))\n",
2165
    "        summaries_temp['Order_max'].append(max(np.concatenate(summary_order)))\n",
2166
    "        summaries_temp['Order_min'].append(min(np.concatenate(summary_order)))\n",
2167
    "        summaries_temp['Importance_avg'].append(statistics.mean(np.concatenate(summary_importance)))\n",
2168
    "\n",
2169
    "    summaries_df = pd.DataFrame({'Name': list(features_arg.columns.values)})\n",
2170
    "    summaries_df['Order_avg'] = summaries_temp['Order_avg']\n",
2171
    "    summaries_df['Order_max'] = summaries_temp['Order_max']\n",
2172
    "    summaries_df['Order_min'] = summaries_temp['Order_min']\n",
2173
    "    summaries_df['Importance_avg'] = summaries_temp['Importance_avg']\n",
2174
    "    summaries_df = summaries_df.sort_values(['Order_avg'], ascending = [1])\n",
2175
    "    return summaries_df"
2176
   ]
2177
  },
2178
  {
2179
   "cell_type": "code",
2180
   "execution_count": null,
2181
   "metadata": {
2182
    "collapsed": true,
2183
    "deletable": true,
2184
    "editable": true
2185
   },
2186
   "outputs": [],
2187
   "source": [
2188
    "# combine scores\n",
2189
    "summaries_df = dict()\n",
2190
    "\n",
2191
    "for rank_model in o_summaries_df.keys():\n",
2192
    "    summaries_df[rank_model] = dict()\n",
2193
    "    summaries_df[rank_model] = rank_summarise(features[\"train_indep\"], o_summaries_df[rank_model])"
2194
   ]
2195
  },
2196
  {
2197
   "cell_type": "markdown",
2198
   "metadata": {
2199
    "deletable": true,
2200
    "editable": true
2201
   },
2202
   "source": [
2203
    "Save"
2204
   ]
2205
  },
2206
  {
2207
   "cell_type": "code",
2208
   "execution_count": null,
2209
   "metadata": {
2210
    "collapsed": false,
2211
    "deletable": true,
2212
    "editable": true
2213
   },
2214
   "outputs": [],
2215
   "source": [
2216
    "for rank_model in model_rank.keys():\n",
2217
    "    file_name = \"Step_07_Model_Train_model_rank_\" + rank_model\n",
2218
    "    readers_writers.save_serialised_compressed(path=CONSTANTS.io_path, title=file_name, objects=model_rank[rank_model])\n",
2219
    "    \n",
2220
    "    file_name = \"Step_07_Model_Train_model_rank_summaries_\" + rank_model\n",
2221
    "    readers_writers.save_serialised_compressed(path=CONSTANTS.io_path, title=file_name, objects=o_summaries_df[rank_model])"
2222
   ]
2223
  },
2224
  {
2225
   "cell_type": "markdown",
2226
   "metadata": {
2227
    "deletable": true,
2228
    "editable": true
2229
   },
2230
   "source": [
2231
    "### 7.4. Select Top Features"
2232
   ]
2233
  },
2234
  {
2235
   "cell_type": "markdown",
2236
   "metadata": {
2237
    "deletable": true,
2238
    "editable": true
2239
   },
2240
   "source": [
2241
    "<font style=\"font-weight:bold;color:red\">Configure:</font> the selection method"
2242
   ]
2243
  },
2244
  {
2245
   "cell_type": "code",
2246
   "execution_count": null,
2247
   "metadata": {
2248
    "collapsed": true,
2249
    "deletable": true,
2250
    "editable": true
2251
   },
2252
   "outputs": [],
2253
   "source": [
2254
    "rank_model = \"rfc\"\n",
2255
    "file_name = \"Step_07_Top_Features_\" + rank_model\n",
2256
    "rank_top_features_max = 400\n",
2257
    "rank_top_features_score_min = 0.1 * (10 ^ -20)\n",
2258
    "\n",
2259
    "# sort features\n",
2260
    "features_names_selected = summaries_df[rank_model]['Name'][summaries_df[rank_model]['Order_avg'] >= rank_top_features_score_min]\n",
2261
    "features_names_selected = (features_names_selected[0:rank_top_features_max]).tolist()"
2262
   ]
2263
  },
2264
  {
2265
   "cell_type": "markdown",
2266
   "metadata": {
2267
    "deletable": true,
2268
    "editable": true
2269
   },
2270
   "source": [
2271
    "Save"
2272
   ]
2273
  },
2274
  {
2275
   "cell_type": "code",
2276
   "execution_count": null,
2277
   "metadata": {
2278
    "collapsed": false,
2279
    "deletable": true,
2280
    "editable": true
2281
   },
2282
   "outputs": [],
2283
   "source": [
2284
    "# save to CSV\n",
2285
    "readers_writers.save_csv(path=CONSTANTS.io_path, title=file_name, data=features_names_selected, append=False, header=False)\n",
2286
    "\n",
2287
    "# print     \n",
2288
    "print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
2289
    "print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")\n",
2290
    "print(\"List of sorted features, which can be modified:\\n  \" + CONSTANTS.io_path + file_name + \"csv\")"
2291
   ]
2292
  },
2293
  {
2294
   "cell_type": "markdown",
2295
   "metadata": {
2296
    "deletable": true,
2297
    "editable": true
2298
   },
2299
   "source": [
2300
    "<font style=\"font-weight:bold;color:red\">Configure</font>: the selected feature manually if it isnecessary!"
2301
   ]
2302
  },
2303
  {
2304
   "cell_type": "code",
2305
   "execution_count": null,
2306
   "metadata": {
2307
    "collapsed": false,
2308
    "deletable": true,
2309
    "editable": true
2310
   },
2311
   "outputs": [],
2312
   "source": [
2313
    "file_name = \"Step_07_Top_Features_rfc_adhoc\" \n",
2314
    "\n",
2315
    "features_names_selected = readers_writers.load_csv(path=CONSTANTS.io_path, title=file_name, dataframing=False)[0]\n",
2316
    "features_names_selected = [f.replace(\"\\n\", \"\") for f in features_names_selected]\n",
2317
    "display(pd.DataFrame(features_names_selected))"
2318
   ]
2319
  },
2320
  {
2321
   "cell_type": "markdown",
2322
   "metadata": {
2323
    "deletable": true,
2324
    "editable": true
2325
   },
2326
   "source": [
2327
    "Verify the top features visually"
2328
   ]
2329
  },
2330
  {
2331
   "cell_type": "code",
2332
   "execution_count": null,
2333
   "metadata": {
2334
    "collapsed": false,
2335
    "deletable": true,
2336
    "editable": true
2337
   },
2338
   "outputs": [],
2339
   "source": [
2340
    "# print     \n",
2341
    "print(\"Number of columns: \", len(features[\"train_indep\"].columns), \n",
2342
    "    \";\\nNumber of top columns: \", len(features[\"train_indep\"][features_names_selected].columns)) \n",
2343
    "print(\"features: {train: \", len(features[\"train_indep\"][features_names_selected]), \", test: \", len(features[\"test_indep\"][features_names_selected]), \"}\")"
2344
   ]
2345
  },
2346
  {
2347
   "cell_type": "markdown",
2348
   "metadata": {
2349
    "deletable": true,
2350
    "editable": true
2351
   },
2352
   "source": [
2353
    "### 7.5. Summary Statistics"
2354
   ]
2355
  },
2356
  {
2357
   "cell_type": "markdown",
2358
   "metadata": {
2359
    "deletable": true,
2360
    "editable": true
2361
   },
2362
   "source": [
2363
    "Produce a descriptive stat report of 'Categorical', 'Continuous', & 'TARGET' features"
2364
   ]
2365
  },
2366
  {
2367
   "cell_type": "code",
2368
   "execution_count": null,
2369
   "metadata": {
2370
    "collapsed": true,
2371
    "deletable": true,
2372
    "editable": true
2373
   },
2374
   "outputs": [],
2375
   "source": [
2376
    "# columns\n",
2377
    "file_name = \"Step_07_Data_ColumnNames_Train\"\n",
2378
    "readers_writers.save_csv(path=CONSTANTS.io_path, title=file_name, \n",
2379
    "                         data=list(features[\"train_indep\"][features_names_selected].columns.values), append=False)\n",
2380
    "\n",
2381
    "# Sample - Train\n",
2382
    "file_name = \"Step_07_Stats_Categorical_Train\"\n",
2383
    "o_stats = preprocess.stats_discrete_df(df=features[\"train_indep\"][features_names_selected], includes=features_types_group[\"CATEGORICAL\"], \n",
2384
    "                                       file_name=file_name)\n",
2385
    "file_name = \"Step_07_Stats_Continuous_Train\"\n",
2386
    "o_stats = preprocess.stats_continuous_df(df=features[\"train_indep\"][features_names_selected], includes=features_types_group[\"CONTINUOUS\"], \n",
2387
    "                                         file_name=file_name)\n",
2388
    "\n",
2389
    "# Sample - Test\n",
2390
    "file_name = \"Step_07_Stats_Categorical_Test\"\n",
2391
    "o_stats = preprocess.stats_discrete_df(df=features[\"test_indep\"][features_names_selected], includes=features_types_group[\"CATEGORICAL\"],\n",
2392
    "                                       file_name=file_name)\n",
2393
    "file_name = \"Step_07_Stats_Continuous_Test\"\n",
2394
    "o_stats = preprocess.stats_continuous_df(df=features[\"test_indep\"][features_names_selected], includes=features_types_group[\"CONTINUOUS\"], \n",
2395
    "                                         file_name=file_name)"
2396
   ]
2397
  },
2398
  {
2399
   "cell_type": "markdown",
2400
   "metadata": {
2401
    "deletable": true,
2402
    "editable": true
2403
   },
2404
   "source": [
2405
    "### 7.6. Save Features"
2406
   ]
2407
  },
2408
  {
2409
   "cell_type": "code",
2410
   "execution_count": null,
2411
   "metadata": {
2412
    "collapsed": false,
2413
    "deletable": true,
2414
    "editable": true
2415
   },
2416
   "outputs": [],
2417
   "source": [
2418
    "file_name = \"Step_07_Features\"\n",
2419
    "readers_writers.save_serialised_compressed(path=CONSTANTS.io_path, title=file_name, objects=features)\n",
2420
    "\n",
2421
    "# print     \n",
2422
    "print(\"File size: \", os.stat(os.path.join(CONSTANTS.io_path, file_name + \".bz2\")).st_size)\n",
2423
    "print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
2424
    "print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
2425
   ]
2426
  },
2427
  {
2428
   "cell_type": "markdown",
2429
   "metadata": {
2430
    "deletable": true,
2431
    "editable": true
2432
   },
2433
   "source": [
2434
    "<br/><br/>"
2435
   ]
2436
  },
2437
  {
2438
   "cell_type": "markdown",
2439
   "metadata": {
2440
    "deletable": true,
2441
    "editable": true
2442
   },
2443
   "source": [
2444
    "<br/><br/>"
2445
   ]
2446
  },
2447
  {
2448
   "cell_type": "markdown",
2449
   "metadata": {
2450
    "deletable": true,
2451
    "editable": true
2452
   },
2453
   "source": [
2454
    "## 8. Model"
2455
   ]
2456
  },
2457
  {
2458
   "cell_type": "markdown",
2459
   "metadata": {
2460
    "collapsed": true,
2461
    "deletable": true,
2462
    "editable": true
2463
   },
2464
   "source": [
2465
    "<font style=\"font-weight:bold;color:orange\">Load a Saved Samples and Features Ranking:</font> \n",
2466
    "<br/> It is an optional step. The step loads the serialised & compressed outputs of Step-7."
2467
   ]
2468
  },
2469
  {
2470
   "cell_type": "code",
2471
   "execution_count": null,
2472
   "metadata": {
2473
    "collapsed": false,
2474
    "deletable": true,
2475
    "editable": true
2476
   },
2477
   "outputs": [],
2478
   "source": [
2479
    "# open fetures\n",
2480
    "file_name = \"Step_07_Features\"\n",
2481
    "features = readers_writers.load_serialised_compressed(path=CONSTANTS.io_path, title=file_name)\n",
2482
    "\n",
2483
    "# print     \n",
2484
    "print(\"File size: \", os.stat(os.path.join(CONSTANTS.io_path, file_name + \".bz2\")).st_size)\n",
2485
    "print(\"Number of columns: \", len(features[\"train_indep\"].columns)) \n",
2486
    "print(\"features: {train: \", len(features[\"train_indep\"]), \", test: \", len(features[\"test_indep\"]), \"}\")"
2487
   ]
2488
  },
2489
  {
2490
   "cell_type": "code",
2491
   "execution_count": null,
2492
   "metadata": {
2493
    "collapsed": true,
2494
    "deletable": true,
2495
    "editable": true
2496
   },
2497
   "outputs": [],
2498
   "source": [
2499
    "# open scoring model files\n",
2500
    "rank_models = [\"rfc\", \"gbrt\", \"randLogit\"]\n",
2501
    "model_rank = dict()\n",
2502
    "o_summaries_df = dict()\n",
2503
    "\n",
2504
    "for rank_model in rank_models:\n",
2505
    "    file_name = \"Step_07_Model_Train_model_rank_\" + rank_model\n",
2506
    "    if not readers_writers.exists_serialised(path=CONSTANTS.io_path, title=file_name, ext=\"bz2\"):\n",
2507
    "        continue\n",
2508
    "\n",
2509
    "    file_name = \"Step_07_Model_Train_model_rank_\" + rank_model\n",
2510
    "    model_rank[rank_model] = readers_writers.load_serialised_compressed(path=CONSTANTS.io_path, title=file_name)\n",
2511
    "\n",
2512
    "    file_name = \"Step_07_Model_Train_model_rank_summaries_\" + rank_model\n",
2513
    "    o_summaries_df[rank_model] = readers_writers.load_serialised_compressed(path=CONSTANTS.io_path, title=file_name)"
2514
   ]
2515
  },
2516
  {
2517
   "cell_type": "markdown",
2518
   "metadata": {
2519
    "deletable": true,
2520
    "editable": true
2521
   },
2522
   "source": [
2523
    "Verify features visually"
2524
   ]
2525
  },
2526
  {
2527
   "cell_type": "code",
2528
   "execution_count": null,
2529
   "metadata": {
2530
    "collapsed": false,
2531
    "deletable": true,
2532
    "editable": true,
2533
    "scrolled": true
2534
   },
2535
   "outputs": [],
2536
   "source": [
2537
    "display(pd.concat([features[\"train_id\"].head(), features[\"train_target\"].head(), features[\"train_indep\"].head()], axis=1))\n",
2538
    "display(pd.concat([features[\"test_id\"].head(), features[\"test_target\"].head(), features[\"test_indep\"].head()], axis=1))"
2539
   ]
2540
  },
2541
  {
2542
   "cell_type": "markdown",
2543
   "metadata": {
2544
    "deletable": true,
2545
    "editable": true
2546
   },
2547
   "source": [
2548
    "<br/><br/>"
2549
   ]
2550
  },
2551
  {
2552
   "cell_type": "markdown",
2553
   "metadata": {
2554
    "deletable": true,
2555
    "editable": true
2556
   },
2557
   "source": [
2558
    "### 8.1.  Initialise"
2559
   ]
2560
  },
2561
  {
2562
   "cell_type": "markdown",
2563
   "metadata": {
2564
    "deletable": true,
2565
    "editable": true
2566
   },
2567
   "source": [
2568
    "#### 8.1.1. Algorithms"
2569
   ]
2570
  },
2571
  {
2572
   "cell_type": "markdown",
2573
   "metadata": {
2574
    "deletable": true,
2575
    "editable": true
2576
   },
2577
   "source": [
2578
    "<font style=\"font-weight:bold;color:red\">Configure:</font> the trianing algorithm"
2579
   ]
2580
  },
2581
  {
2582
   "cell_type": "markdown",
2583
   "metadata": {
2584
    "deletable": true,
2585
    "editable": true
2586
   },
2587
   "source": [
2588
    "<font style=\"font-weight:bold;color:brown\">Algorithm 1</font>: Random Forest"
2589
   ]
2590
  },
2591
  {
2592
   "cell_type": "code",
2593
   "execution_count": null,
2594
   "metadata": {
2595
    "collapsed": true,
2596
    "deletable": true,
2597
    "editable": true,
2598
    "scrolled": true
2599
   },
2600
   "outputs": [],
2601
   "source": [
2602
    "method_name = \"rfc\"\n",
2603
    "kwargs = {\"n_estimators\": 20, \"criterion\": 'gini', \"max_depth\": None, \"min_samples_split\": 100,\n",
2604
    "    \"min_samples_leaf\": 50, \"min_weight_fraction_leaf\": 0.0, \"max_features\": 'auto',\n",
2605
    "    \"max_leaf_nodes\": None, \"bootstrap\": True, \"oob_score\": False, \"n_jobs\": -1, \"random_state\": None,\n",
2606
    "    \"verbose\": 0, \"warm_start\": False, \"class_weight\": \"balanced_subsample\"}"
2607
   ]
2608
  },
2609
  {
2610
   "cell_type": "markdown",
2611
   "metadata": {
2612
    "deletable": true,
2613
    "editable": true
2614
   },
2615
   "source": [
2616
    "<font style=\"font-weight:bold;color:brown\">Algorithm 2</font>: Logistic Regression"
2617
   ]
2618
  },
2619
  {
2620
   "cell_type": "code",
2621
   "execution_count": null,
2622
   "metadata": {
2623
    "collapsed": true,
2624
    "deletable": true,
2625
    "editable": true,
2626
    "scrolled": true
2627
   },
2628
   "outputs": [],
2629
   "source": [
2630
    "method_name = \"lr\"\n",
2631
    "kwargs = {\"penalty\": 'l1', \"dual\": False, \"tol\": 0.0001, \"C\": 1, \"fit_intercept\": True, \"intercept_scaling\": 1,\n",
2632
    "          \"class_weight\": None, \"random_state\": None, \"solver\": 'liblinear', \"max_iter\": 100, \"multi_class\": 'ovr',\n",
2633
    "          \"verbose\": 0, \"warm_start\": False, \"n_jobs\": -1}"
2634
   ]
2635
  },
2636
  {
2637
   "cell_type": "markdown",
2638
   "metadata": {
2639
    "deletable": true,
2640
    "editable": true
2641
   },
2642
   "source": [
2643
    "<font style=\"font-weight:bold;color:brown\">Algorithm 3</font>: Logistic Cross-Validation"
2644
   ]
2645
  },
2646
  {
2647
   "cell_type": "code",
2648
   "execution_count": null,
2649
   "metadata": {
2650
    "collapsed": true,
2651
    "deletable": true,
2652
    "editable": true
2653
   },
2654
   "outputs": [],
2655
   "source": [
2656
    "method_name = \"lr_cv\"\n",
2657
    "kwargs = {\"Cs\": 10, \"fit_intercept\": True, \"cv\": None, \"dual\": False, \"penalty\": 'l2', \"scoring\": None, \n",
2658
    "          \"solver\": 'lbfgs', \"tol\": 0.0001, \"max_iter\": 10, \"class_weight\": None, \"n_jobs\": -1, \"verbose\": 0, \n",
2659
    "          \"refit\": True, \"intercept_scaling\": 1.0, \"multi_class\": \"ovr\", \"random_state\": None}"
2660
   ]
2661
  },
2662
  {
2663
   "cell_type": "markdown",
2664
   "metadata": {
2665
    "deletable": true,
2666
    "editable": true
2667
   },
2668
   "source": [
2669
    "<font style=\"font-weight:bold;color:brown\">Algorithm 4</font>: Neural Network"
2670
   ]
2671
  },
2672
  {
2673
   "cell_type": "code",
2674
   "execution_count": null,
2675
   "metadata": {
2676
    "collapsed": true,
2677
    "deletable": true,
2678
    "editable": true
2679
   },
2680
   "outputs": [],
2681
   "source": [
2682
    "method_name = \"nn\"\n",
2683
    "kwargs = {\"solver\": 'lbfgs', \"alpha\": 1e-5, \"hidden_layer_sizes\": (5, 2), \"random_state\": 1}"
2684
   ]
2685
  },
2686
  {
2687
   "cell_type": "markdown",
2688
   "metadata": {
2689
    "deletable": true,
2690
    "editable": true
2691
   },
2692
   "source": [
2693
    "<font style=\"font-weight:bold;color:brown\">Algorithm 5</font>: k-Nearest Neighbourhood"
2694
   ]
2695
  },
2696
  {
2697
   "cell_type": "code",
2698
   "execution_count": null,
2699
   "metadata": {
2700
    "collapsed": true,
2701
    "deletable": true,
2702
    "editable": true
2703
   },
2704
   "outputs": [],
2705
   "source": [
2706
    "method_name = \"knc\"\n",
2707
    "kwargs = {\"n_neighbors\": 5, \"weights\": 'distance', \"algorithm\": 'auto', \"leaf_size\": 30,\n",
2708
    "          \"p\": 2, \"metric\": 'minkowski', \"metric_params\": None, \"n_jobs\": -1}"
2709
   ]
2710
  },
2711
  {
2712
   "cell_type": "markdown",
2713
   "metadata": {
2714
    "deletable": true,
2715
    "editable": true
2716
   },
2717
   "source": [
2718
    "<font style=\"font-weight:bold;color:brown\">Algorithm 6</font>: Decision Tree"
2719
   ]
2720
  },
2721
  {
2722
   "cell_type": "code",
2723
   "execution_count": null,
2724
   "metadata": {
2725
    "collapsed": true,
2726
    "deletable": true,
2727
    "editable": true
2728
   },
2729
   "outputs": [],
2730
   "source": [
2731
    "method_name = \"dtc\"\n",
2732
    "kwargs = {\"criterion\": 'gini', \"splitter\": 'best', \"max_depth\": None, \"min_samples_split\": 30,\n",
2733
    "        \"min_samples_leaf\": 30, \"min_weight_fraction_leaf\": 0.0, \"max_features\": None,\n",
2734
    "        \"random_state\": None, \"max_leaf_nodes\": None, \"class_weight\": None, \"presort\": False}"
2735
   ]
2736
  },
2737
  {
2738
   "cell_type": "markdown",
2739
   "metadata": {
2740
    "deletable": true,
2741
    "editable": true
2742
   },
2743
   "source": [
2744
    "<font style=\"font-weight:bold;color:brown\">Algorithm 7</font>: Gradient Boosting Classifier"
2745
   ]
2746
  },
2747
  {
2748
   "cell_type": "code",
2749
   "execution_count": null,
2750
   "metadata": {
2751
    "collapsed": true,
2752
    "deletable": true,
2753
    "editable": true
2754
   },
2755
   "outputs": [],
2756
   "source": [
2757
    "method_name = \"gbc\"\n",
2758
    "kwargs = {\"loss\": 'deviance', \"learning_rate\": 0.1, \"n_estimators\": 100, \"subsample\": 1.0, \"min_samples_split\": 30,\n",
2759
    "        \"min_samples_leaf\": 30, \"min_weight_fraction_leaf\": 0.0, \"max_depth\": 3, \"init\": None, \"random_state\": None,\n",
2760
    "        \"max_features\": None, \"verbose\": 0, \"max_leaf_nodes\": None, \"warm_start\": False, \"presort\": 'auto'}"
2761
   ]
2762
  },
2763
  {
2764
   "cell_type": "markdown",
2765
   "metadata": {
2766
    "deletable": true,
2767
    "editable": true
2768
   },
2769
   "source": [
2770
    "<font style=\"font-weight:bold;color:brown\">Algorithm 8</font>: Naive Bayes<br/>\n",
2771
    "Note: features must be positive"
2772
   ]
2773
  },
2774
  {
2775
   "cell_type": "code",
2776
   "execution_count": null,
2777
   "metadata": {
2778
    "collapsed": false,
2779
    "deletable": true,
2780
    "editable": true,
2781
    "scrolled": true
2782
   },
2783
   "outputs": [],
2784
   "source": [
2785
    "method_name = \"nb\"\n",
2786
    "training_method = TrainingMethod(method_name)\n",
2787
    "kwargs = {\"alpha\": 1.0, \"fit_prior\": True, \"class_prior\": None}"
2788
   ]
2789
  },
2790
  {
2791
   "cell_type": "markdown",
2792
   "metadata": {
2793
    "deletable": true,
2794
    "editable": true
2795
   },
2796
   "source": [
2797
    "<br/><br/>"
2798
   ]
2799
  },
2800
  {
2801
   "cell_type": "markdown",
2802
   "metadata": {
2803
    "deletable": true,
2804
    "editable": true
2805
   },
2806
   "source": [
2807
    "#### 8.1.2. Other Settings"
2808
   ]
2809
  },
2810
  {
2811
   "cell_type": "markdown",
2812
   "metadata": {
2813
    "deletable": true,
2814
    "editable": true
2815
   },
2816
   "source": [
2817
    "<font style=\"font-weight:bold;color:red\">Configure:</font> other modelling settings"
2818
   ]
2819
  },
2820
  {
2821
   "cell_type": "code",
2822
   "execution_count": null,
2823
   "metadata": {
2824
    "collapsed": false,
2825
    "deletable": true,
2826
    "editable": true
2827
   },
2828
   "outputs": [],
2829
   "source": [
2830
    "# select the target variable\n",
2831
    "target_feature = \"label365\" # \"label30\" , \"label365\" \n",
2832
    "\n",
2833
    "# file name\n",
2834
    "file_name = \"Step_09_Model_\" + method_name + \"_\" + target_feature\n",
2835
    "\n",
2836
    "# initialise\n",
2837
    "training_method = TrainingMethod(method_name)"
2838
   ]
2839
  },
2840
  {
2841
   "cell_type": "markdown",
2842
   "metadata": {
2843
    "deletable": true,
2844
    "editable": true
2845
   },
2846
   "source": [
2847
    "#### 8.1.3. Features"
2848
   ]
2849
  },
2850
  {
2851
   "cell_type": "code",
2852
   "execution_count": null,
2853
   "metadata": {
2854
    "collapsed": true,
2855
    "deletable": true,
2856
    "editable": true
2857
   },
2858
   "outputs": [],
2859
   "source": [
2860
    "sample_train = features[\"train_indep\"][features_names_selected] # features[\"train_indep\"][features_names_selected], features[\"train_indep\"]\n",
2861
    "sample_train_target = features[\"train_target\"][target_feature] # features[\"train_target\"][target_feature]\n",
2862
    "sample_test = features[\"test_indep\"][features_names_selected] # features[\"test_indep\"][features_names_selected], features[\"test_indep\"]\n",
2863
    "sample_test_target = features[\"test_target\"][target_feature] # features[\"test_target\"][target_feature]"
2864
   ]
2865
  },
2866
  {
2867
   "cell_type": "markdown",
2868
   "metadata": {
2869
    "deletable": true,
2870
    "editable": true
2871
   },
2872
   "source": [
2873
    "### 8.3. Fit"
2874
   ]
2875
  },
2876
  {
2877
   "cell_type": "markdown",
2878
   "metadata": {
2879
    "deletable": true,
2880
    "editable": true
2881
   },
2882
   "source": [
2883
    "Fit the model, using the train sample"
2884
   ]
2885
  },
2886
  {
2887
   "cell_type": "code",
2888
   "execution_count": null,
2889
   "metadata": {
2890
    "collapsed": false,
2891
    "deletable": true,
2892
    "editable": true,
2893
    "scrolled": false
2894
   },
2895
   "outputs": [],
2896
   "source": [
2897
    "o_summaries = dict()\n",
2898
    "# Fit\n",
2899
    "model = training_method.train(sample_train, sample_train_target, **kwargs)\n",
2900
    "training_method.save_model(path=CONSTANTS.io_path, title=file_name)"
2901
   ]
2902
  },
2903
  {
2904
   "cell_type": "code",
2905
   "execution_count": null,
2906
   "metadata": {
2907
    "collapsed": true,
2908
    "deletable": true,
2909
    "editable": true
2910
   },
2911
   "outputs": [],
2912
   "source": [
2913
    "# load model\n",
2914
    "# training_method.load(path=CONSTANTS.io_path, title=file_name)"
2915
   ]
2916
  },
2917
  {
2918
   "cell_type": "code",
2919
   "execution_count": null,
2920
   "metadata": {
2921
    "collapsed": true,
2922
    "deletable": true,
2923
    "editable": true
2924
   },
2925
   "outputs": [],
2926
   "source": [
2927
    "# short summary\n",
2928
    "o_summaries = training_method.train_summaries()"
2929
   ]
2930
  },
2931
  {
2932
   "cell_type": "markdown",
2933
   "metadata": {
2934
    "deletable": true,
2935
    "editable": true
2936
   },
2937
   "source": [
2938
    "Predict & report performance, using the train sample"
2939
   ]
2940
  },
2941
  {
2942
   "cell_type": "code",
2943
   "execution_count": null,
2944
   "metadata": {
2945
    "collapsed": false,
2946
    "deletable": true,
2947
    "editable": true,
2948
    "scrolled": true
2949
   },
2950
   "outputs": [],
2951
   "source": [
2952
    "o_summaries = dict()\n",
2953
    "# predict\n",
2954
    "model = training_method.predict(sample_train, \"train\")"
2955
   ]
2956
  },
2957
  {
2958
   "cell_type": "code",
2959
   "execution_count": null,
2960
   "metadata": {
2961
    "collapsed": false,
2962
    "deletable": true,
2963
    "editable": true
2964
   },
2965
   "outputs": [],
2966
   "source": [
2967
    "# short summary\n",
2968
    "o_summaries = training_method.predict_summaries(pd.Series(sample_train_target), \"train\")\n",
2969
    "\n",
2970
    "# Print the main performance statistics\n",
2971
    "for k in o_summaries.keys():\n",
2972
    "    print(k,  o_summaries[k])\n",
2973
    "\n",
2974
    "# Print the by risk-bands of a selection of statistics\n",
2975
    "o_summaries = training_method.predict_summaries_risk_bands(pd.Series(sample_train_target), \"train\", np.arange(0, 1.05, 0.05))\n",
2976
    "display(o_summaries)"
2977
   ]
2978
  },
2979
  {
2980
   "cell_type": "markdown",
2981
   "metadata": {
2982
    "deletable": true,
2983
    "editable": true
2984
   },
2985
   "source": [
2986
    "### 8.4. Predict"
2987
   ]
2988
  },
2989
  {
2990
   "cell_type": "markdown",
2991
   "metadata": {
2992
    "deletable": true,
2993
    "editable": true
2994
   },
2995
   "source": [
2996
    "Predict & report performance, using the test sample"
2997
   ]
2998
  },
2999
  {
3000
   "cell_type": "code",
3001
   "execution_count": null,
3002
   "metadata": {
3003
    "collapsed": false,
3004
    "deletable": true,
3005
    "editable": true,
3006
    "scrolled": false
3007
   },
3008
   "outputs": [],
3009
   "source": [
3010
    "o_summaries = dict()\n",
3011
    "# predict\n",
3012
    "model = training_method.predict(sample_test, \"test\")"
3013
   ]
3014
  },
3015
  {
3016
   "cell_type": "code",
3017
   "execution_count": null,
3018
   "metadata": {
3019
    "collapsed": false,
3020
    "deletable": true,
3021
    "editable": true,
3022
    "scrolled": false
3023
   },
3024
   "outputs": [],
3025
   "source": [
3026
    "# short summary\n",
3027
    "o_summaries = training_method.predict_summaries(pd.Series(sample_test_target), \"test\")\n",
3028
    "\n",
3029
    "# Print the main performance statistics\n",
3030
    "for k in o_summaries.keys():\n",
3031
    "    print(k,  o_summaries[k])\n",
3032
    "\n",
3033
    "# Print the by risk-bands of a selection of statistics\n",
3034
    "o_summaries = training_method.predict_summaries_risk_bands(pd.Series(sample_test_target), \"test\", np.arange(0, 1.05, 0.05))\n",
3035
    "display(o_summaries)"
3036
   ]
3037
  },
3038
  {
3039
   "cell_type": "markdown",
3040
   "metadata": {
3041
    "collapsed": true,
3042
    "deletable": true,
3043
    "editable": true
3044
   },
3045
   "source": [
3046
    "### 8.5. Cross-Validation"
3047
   ]
3048
  },
3049
  {
3050
   "cell_type": "markdown",
3051
   "metadata": {
3052
    "deletable": true,
3053
    "editable": true
3054
   },
3055
   "source": [
3056
    "Perform k-fold cross-validation"
3057
   ]
3058
  },
3059
  {
3060
   "cell_type": "code",
3061
   "execution_count": null,
3062
   "metadata": {
3063
    "collapsed": false,
3064
    "deletable": true,
3065
    "editable": true
3066
   },
3067
   "outputs": [],
3068
   "source": [
3069
    "o_summaries = dict()\n",
3070
    "score = training_method.cross_validate(sample_test, sample_test_target, scoring=\"neg_mean_squared_error\", cv=10)"
3071
   ]
3072
  },
3073
  {
3074
   "cell_type": "code",
3075
   "execution_count": null,
3076
   "metadata": {
3077
    "collapsed": false,
3078
    "deletable": true,
3079
    "editable": true
3080
   },
3081
   "outputs": [],
3082
   "source": [
3083
    "# short summary\n",
3084
    "o_summaries = training_method.cross_validate_summaries()\n",
3085
    "print(\"Scores: \", o_summaries)"
3086
   ]
3087
  },
3088
  {
3089
   "cell_type": "markdown",
3090
   "metadata": {
3091
    "deletable": true,
3092
    "editable": true
3093
   },
3094
   "source": [
3095
    "### 8.6. Save"
3096
   ]
3097
  },
3098
  {
3099
   "cell_type": "markdown",
3100
   "metadata": {
3101
    "deletable": true,
3102
    "editable": true
3103
   },
3104
   "source": [
3105
    "Save the training model. "
3106
   ]
3107
  },
3108
  {
3109
   "cell_type": "code",
3110
   "execution_count": null,
3111
   "metadata": {
3112
    "collapsed": false,
3113
    "deletable": true,
3114
    "editable": true,
3115
    "scrolled": false
3116
   },
3117
   "outputs": [],
3118
   "source": [
3119
    "training_method.save_model(path=CONSTANTS.io_path, title=file_name)"
3120
   ]
3121
  },
3122
  {
3123
   "cell_type": "markdown",
3124
   "metadata": {
3125
    "deletable": true,
3126
    "editable": true
3127
   },
3128
   "source": [
3129
    "<br/><br/>"
3130
   ]
3131
  },
3132
  {
3133
   "cell_type": "markdown",
3134
   "metadata": {
3135
    "deletable": true,
3136
    "editable": true
3137
   },
3138
   "source": [
3139
    "Fin!"
3140
   ]
3141
  }
3142
 ],
3143
 "metadata": {
3144
  "kernelspec": {
3145
   "display_name": "Python 3",
3146
   "language": "python",
3147
   "name": "python3"
3148
  },
3149
  "language_info": {
3150
   "codemirror_mode": {
3151
    "name": "ipython",
3152
    "version": 3
3153
   },
3154
   "file_extension": ".py",
3155
   "mimetype": "text/x-python",
3156
   "name": "python",
3157
   "nbconvert_exporter": "python",
3158
   "pygments_lexer": "ipython3",
3159
   "version": "3.5.3"
3160
  }
3161
 },
3162
 "nbformat": 4,
3163
 "nbformat_minor": 1
3164
}