[665646]: / dataset / read-and-visualize-data.ipynb

Download this file

864 lines (863 with data), 29.5 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Description of this notebook\n",
    "\n",
    "The functions defined below help pre-process the raw data of the epilepsy problem and make it suitable for CNN models defined in this project to take as input. In this notebook, we split the data in training/dev/test sets. We perform data-augmentation of 1-D data using sliding windows to create many more examples for neural networks to be trained on."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from os import listdir, mkdir\n",
    "from os.path import isfile, join\n",
    "import h5py as h5\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use `export_to` variable to point to the directory where the dataset after processing will be stored. Please only change the iteration number of this variable (for smooth operation) and not the prefix. Example: <br>\n",
    "\n",
    "`export_to = 'random-iter-5/'`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "export_to = 'random-iter-1/'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# If this code shows an error, the directory probably already exists\n",
    "# In that case, increment the above export_to variable's value\n",
    "# Example previous export_to = 'random-iter-4'. Set to 'random-iter-5'\n",
    "\n",
    "mkdir(export_to)\n",
    "print(\"Directory created\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reading data\n",
    "\n",
    "The sets in the dataset are divided into five categories. Follow the link to check the description: [Bonn Dataset](http://epileptologie-bonn.de/cms/front_content.php?idcat=193&lang=3)\n",
    "\n",
    "\n",
    "SET A:\tZ directory with\tZ000.txt - Z100.txt<br>\n",
    "SET B: \tO directory with\tO000.txt - O100.txt<br>\n",
    "SET C:\tN directory with\tN000.txt - N100.txt<br>\n",
    "SET D:\tF directory\twith\tF000.txt - F100.txt<br>\n",
    "SET E:\tS directory with\tS000.txt - S100.txt<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Mapping from folder names (ZONFS) to set names (ABCDE)\n",
    "\n",
    "mapping_set_to_dir = {\n",
    "    'A': (0,'Z'),\n",
    "    'B': (1,'O'),\n",
    "    'C': (2,'N'),\n",
    "    'D': (3,'F'),\n",
    "    'E': (4,'S')\n",
    "}\n",
    "\n",
    "file_lists = []\n",
    "for s,d in mapping_set_to_dir.items():\n",
    "    file_lists.insert(d[0], [f for f in listdir(d[1]) if isfile(join(d[1], f))])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "file_lists[4]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read every file from each data folder and store in the dictionary raw_dataset\n",
    "# Please make sure there are no junk files present in the folder\n",
    "\n",
    "raw_dataset = { }\n",
    "\n",
    "for s,d in mapping_set_to_dir.items():\n",
    "    \n",
    "    for f in file_lists[d[0]]:\n",
    "        curr_example = np.loadtxt(join(d[1], f))\n",
    "        \n",
    "\n",
    "        if (s in raw_dataset):\n",
    "            raw_dataset[s] = np.append(raw_dataset[s], [curr_example], axis=0)\n",
    "        else:\n",
    "            raw_dataset[s] = np.array([curr_example])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_dataset['E'].shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualizing data\n",
    "\n",
    "Each training example in raw_dataset contains an EEG signal of 4097 data points.<br>\n",
    "The values represent the values of voltage levels observed by the electrodes.<br>\n",
    "\n",
    "Let's plot a random example and visualize the data that is given to us."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = 'E'\n",
    "index = 46\n",
    "\n",
    "plot_title = ''\n",
    "if (s in ['A', 'B']):\n",
    "    plot_title = 'Sample signal from non-epileptic person'\n",
    "elif (s in ['C', 'D']):\n",
    "    plot_title = 'Sample signal from epileptic person (non-ictal / non-seizure)'\n",
    "elif (s == 'E'):\n",
    "    plot_title = 'Sample signal from epileptic person (ictal / seizure)'\n",
    "else:\n",
    "    print('s can be one of A,B,C,D,E')\n",
    "\n",
    "plt.figure(figsize=(8,4))\n",
    "plt.plot(raw_dataset[s][index], linewidth=0.7)\n",
    "plt.title(plot_title)\n",
    "plt.xlabel('Seconds')\n",
    "plt.ylabel('EEG Signal (\\u00B5V)')\n",
    "tick_points = np.arange(0,9)*500 # each signal contains 4097 data points\n",
    "plt.xticks(tick_points, np.round((tick_points/173.16)*10)/10) # mapping n-th data point to the time in seconds from the start of signal (173.16 Hz)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data augmentation\n",
    "\n",
    "Since neural networks are hungry for data to perform adequately, and the available data has only 500 examples of data (100 per set). A good solution to this problem would be to split the data into smaller signals. In this manner, we'll have lots of examples, albeit small.\n",
    "\n",
    "The below procedure uses sliding window with specified `window_size` and `stride` and returns a new dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_augmented_dataset(raw_dataset, window_size = 512, stride = 64, verbose=False):\n",
    "    \"\"\"\n",
    "    This function augments the dataset using sliding windows\n",
    "    \n",
    "    Parameters:\n",
    "    raw_dataset: Dictionary containing a batch of examples for one set in each key\n",
    "    window_size: Sliding window size\n",
    "    stride: Steps by which sliding window moves and samples\n",
    "    verbose: Display progress\n",
    "    \n",
    "    Returns:\n",
    "    augmented_dataset: Dictionary containing a batch of smaller augmented examples than raw_dataset\n",
    "    \n",
    "    \n",
    "    \"\"\"\n",
    "    \n",
    "    augmented_dataset = { }\n",
    "\n",
    "    for s,Xset_raw in raw_dataset.items():\n",
    "        \n",
    "        if (verbose):\n",
    "            print('Processing set ' + s)\n",
    "\n",
    "        total_points = Xset_raw.shape[1]\n",
    "\n",
    "        # no. of examples generated from single training example using sliding window\n",
    "        # = floor((total_points - window_size) / stride) + 1\n",
    "        iterations = ((total_points - window_size) // stride) + 1\n",
    "\n",
    "        for x_raw in Xset_raw:\n",
    "\n",
    "            for i in range(iterations):\n",
    "                window_slice_from = i*stride\n",
    "                window_slice_to = i*stride + window_size\n",
    "\n",
    "                if (s in augmented_dataset):\n",
    "                    augmented_dataset[s] = np.append(augmented_dataset[s], [x_raw[window_slice_from:window_slice_to]], axis=0)\n",
    "                else:\n",
    "                    augmented_dataset[s] = np.array([x_raw[window_slice_from:window_slice_to]])\n",
    "\n",
    "    \n",
    "    if (verbose):\n",
    "        print('Done.')\n",
    "    return augmented_dataset\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Splitting of the data\n",
    "\n",
    "After randomly shuffling of the dataset, we take three sets, train (90%), dev (5%), and test (5%)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_dataset_train = { }\n",
    "raw_dataset_dev = { }\n",
    "raw_dataset_test = { }\n",
    "\n",
    "for s,Xset_raw in raw_dataset.items():\n",
    "    \n",
    "    # randomly shuffle the data, just to make sure\n",
    "    # that all train/dev/test sets come from same\n",
    "    # distribution, (possibly not from a same person)\n",
    "    np.random.shuffle(raw_dataset[s])\n",
    "    \n",
    "    ninety_percent = np.floor(0.9 * raw_dataset[s].shape[0]).astype(int)\n",
    "    five_percent = np.floor(0.05 * raw_dataset[s].shape[0]).astype(int)\n",
    "    \n",
    "    # train set 0 - 89 (90%)\n",
    "    raw_dataset_train[s] = raw_dataset[s][0:ninety_percent,:]\n",
    "    \n",
    "    # dev set 90 - 94 (5%)\n",
    "    raw_dataset_dev[s] = raw_dataset[s][ninety_percent:ninety_percent+five_percent,:]\n",
    "    \n",
    "    # test set 95 - 99 (5%)\n",
    "    raw_dataset_test[s] = raw_dataset[s][ninety_percent+five_percent:,:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(raw_dataset_train['D'].shape)\n",
    "print(raw_dataset_dev['A'].shape)\n",
    "print(raw_dataset_test['C'].shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Scheme - 1\n",
    "\n",
    "Let's try splitting the data with the window size of 512 and stride of 64. As done in the Scheme-1 of the paper \"An automated system for epilepsy detection using EEG brain signals based on deep learning approach\" by Ihsan Ullah et. al. This method creates 28,500 examples from 500 examples which amounts to 5700 per set (5130 for training)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Augmenting training data\")\n",
    "aug_dataset512_train = create_augmented_dataset(raw_dataset_train, window_size=512, stride=64, verbose=True)\n",
    "\n",
    "print(\"Augmenting dev data\")\n",
    "aug_dataset512_dev = create_augmented_dataset(raw_dataset_dev, window_size=512, stride=256, verbose=True)\n",
    "\n",
    "print(\"Augmenting test data\")\n",
    "aug_dataset512_test = create_augmented_dataset(raw_dataset_test, window_size=512, stride=256, verbose=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Scheme - 2\n",
    "\n",
    "Also, try the window size of 1024 with the stride of 128. Creates 12,500 examples which amount to 2500 per set (11250 for training)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Augmenting training data')\n",
    "aug_dataset1024_train = create_augmented_dataset(raw_dataset_train, window_size=1024, stride=128, verbose=True)\n",
    "\n",
    "print ('Augmenting dev data')\n",
    "aug_dataset1024_dev = create_augmented_dataset(raw_dataset_dev, window_size=1024, stride=512, verbose=True)\n",
    "\n",
    "print('Augmenting test data')\n",
    "aug_dataset1024_test = create_augmented_dataset(raw_dataset_test, window_size=1024, stride=512, verbose=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(aug_dataset512_train['E'].shape)\n",
    "print(aug_dataset512_dev['C'].shape)\n",
    "print(aug_dataset512_test['D'].shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(aug_dataset1024_train['E'].shape)\n",
    "print(aug_dataset1024_dev['C'].shape)\n",
    "print(aug_dataset1024_test['D'].shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Saving the data\n",
    "\n",
    "Calling the `create_augmented_dataset` for the above parameters takes few minutes each time. To speed up learning these augmented datasets are saved to the disk so that they can be loaded next time feasibly. We use a popular library called `h5py` for this purpose.\n",
    "\n",
    "<blockquote><b>Note:</b> The test set is augmented twice (once down below in this notebook). One for measuring accuracy and another for measuring accuracy with voting (as described in the literature of this project).</blockquote>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with h5.File(export_to + 'aug_dataset512.h5', 'w') as aug_file512:\n",
    "    \n",
    "    train512 = aug_file512.create_group('train')\n",
    "    dev512 = aug_file512.create_group('dev')\n",
    "    test512 = aug_file512.create_group('test')\n",
    "\n",
    "    for s in aug_dataset512_train.keys():\n",
    "        \n",
    "        train512.create_dataset(s, data=aug_dataset512_train[s])\n",
    "        dev512.create_dataset(s, data=aug_dataset512_dev[s])\n",
    "        test512.create_dataset(s, data=aug_dataset512_test[s])\n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with h5.File(export_to + 'aug_dataset1024.h5', 'w') as aug_file1024:\n",
    "    \n",
    "    train1024 = aug_file1024.create_group('train')\n",
    "    dev1024 = aug_file1024.create_group('dev')\n",
    "    test1024 = aug_file1024.create_group('test')\n",
    "\n",
    "    for s in aug_dataset1024_train.keys():\n",
    "        \n",
    "        train1024.create_dataset(s, data=aug_dataset1024_train[s])\n",
    "        dev1024.create_dataset(s, data=aug_dataset1024_dev[s])\n",
    "        test1024.create_dataset(s, data=aug_dataset1024_test[s])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aug_file512 = h5.File(export_to + 'aug_dataset512.h5', 'r')\n",
    "print(aug_file512['train']['A'].shape)\n",
    "print(aug_file512['dev']['C'].shape)\n",
    "print(aug_file512['test']['D'].shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aug_file1024 = h5.File(export_to + 'aug_dataset1024.h5', 'r')\n",
    "print(aug_file1024['train']['A'].shape)\n",
    "print(aug_file1024['dev']['C'].shape)\n",
    "print(aug_file1024['test']['D'].shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualizing augmented examples\n",
    "\n",
    "Change the class label `s` = {A, B, C, D, E} and `index` to plot different examples in the augmented dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = 'B'\n",
    "index = 400\n",
    "plt.plot(aug_file512['train'][s][index])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = 'E'\n",
    "index = 175\n",
    "plt.plot(aug_file1024['train'][s][index])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aug_file512.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aug_file1024.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## One-hot labels\n",
    "\n",
    "Let's convert the dataset which is in dictionary format, to a single giant training set which contains all classes and a label set represented in one-hot format.\n",
    "\n",
    "Tensorflow provides built in support for converting the a vector into one hot matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For 3 class classification\n",
    "class_map = { 'A': 0, 'B': 0, 'C': 1, 'D': 1, 'E': 2 }\n",
    "\n",
    "def read_dataset_and_one_hot_labels(filepath, setname, verbose=False):\n",
    "    \"\"\"\n",
    "    Converts the dataset's labels into one-hot encoding\n",
    "    \n",
    "    Parameters:\n",
    "    filepath: augmented dataset without one-hot encoding\n",
    "    setname: either 'train', 'dev', or 'test'\n",
    "    verbose: Display progress\n",
    "    \n",
    "    Returns:\n",
    "    data: dataset with keys X_<setname>, Y_<setname>_classname, and Y_<setname> (one-hot labels). \n",
    "    \"\"\"\n",
    "    \n",
    "    data = { }\n",
    "\n",
    "    # Read dataset that is already stored in the file\n",
    "    with h5.File(filepath, 'r') as aug_file:\n",
    "\n",
    "        # classes are AB, CD, and E\n",
    "        no_of_classes = 3\n",
    "\n",
    "        for s in aug_file[setname].keys():\n",
    "            \n",
    "            if (verbose):\n",
    "                print(\"Processing class: \" + s)\n",
    "            \n",
    "            class_length = len(aug_file[setname][s])\n",
    "            \n",
    "            # go through each example in the class\n",
    "            for i in range(class_length):\n",
    "                \n",
    "                if (verbose and i % (class_length / 5) == 0):\n",
    "                    print((str)((int)(i / class_length * 100)) + '%')\n",
    "                \n",
    "                # create a big (combination of all classes) dataset\n",
    "                # and also set their labels in a separate array\n",
    "                if (('X_' + setname) in data):\n",
    "                    data['X_' + setname] = np.append(data['X_' + setname], [aug_file[setname][s][i]], axis=0)\n",
    "                    data['Y_' + setname + '_classname'] = np.append(data['Y_' + setname + '_classname'], [class_map[s]], axis=0)\n",
    "                else:\n",
    "                    data['X_' + setname] = np.array([aug_file[setname][s][i]])\n",
    "                    data['Y_' + setname + '_classname'] = np.array([class_map[s]])\n",
    "\n",
    "\n",
    "        \n",
    "        if (verbose):\n",
    "            print(\"Converting to one_hot\")\n",
    "            \n",
    "        # use tensorflow one_hot function to convert labels into one_hot values\n",
    "        tf.reset_default_graph()\n",
    "        \n",
    "        init = tf.global_variables_initializer()\n",
    "        \n",
    "        with tf.Session() as sess:\n",
    "            sess.run(init)\n",
    "            data['Y_' + setname] = sess.run(tf.one_hot(data['Y_' + setname + '_classname'], depth=no_of_classes, axis=-1))\n",
    "        \n",
    "        if (verbose):\n",
    "            print(\"Done.\")\n",
    "\n",
    "    return data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Converting training set to one hot\")\n",
    "data512_train = read_dataset_and_one_hot_labels(export_to + 'aug_dataset512.h5', 'train', verbose=True)\n",
    "print(\"Converting dev set to one hot\")\n",
    "data512_dev = read_dataset_and_one_hot_labels(export_to + 'aug_dataset512.h5', 'dev', verbose=True)\n",
    "print(\"Converting test set to one hot\")\n",
    "data512_test = read_dataset_and_one_hot_labels(export_to + 'aug_dataset512.h5', 'test', verbose=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with h5.File(export_to + 'datafile512.h5', 'w') as datafile512:\n",
    "\n",
    "    datafile512.create_dataset('X_train', data=data512_train['X_train'])\n",
    "    datafile512.create_dataset('Y_train_classname', data=data512_train['Y_train_classname'])\n",
    "    datafile512.create_dataset('Y_train', data=data512_train['Y_train'])\n",
    "    \n",
    "    datafile512.create_dataset('X_dev', data=data512_dev['X_dev'])\n",
    "    datafile512.create_dataset('Y_dev_classname', data=data512_dev['Y_dev_classname'])\n",
    "    datafile512.create_dataset('Y_dev', data=data512_dev['Y_dev'])\n",
    "    \n",
    "    datafile512.create_dataset('X_test', data=data512_test['X_test'])\n",
    "    datafile512.create_dataset('Y_test_classname', data=data512_test['Y_test_classname'])\n",
    "    datafile512.create_dataset('Y_test', data=data512_test['Y_test'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Converting training set to one hot\")\n",
    "data1024_train = read_dataset_and_one_hot_labels(export_to + 'aug_dataset1024.h5', 'train', verbose=True)\n",
    "print(\"Converting dev set to one hot\")\n",
    "data1024_dev = read_dataset_and_one_hot_labels(export_to + 'aug_dataset1024.h5', 'dev', verbose=True)\n",
    "print(\"Converting test set to one hot\")\n",
    "data1024_test = read_dataset_and_one_hot_labels(export_to + 'aug_dataset1024.h5', 'test', verbose=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with h5.File(export_to + 'datafile1024.h5', 'w') as datafile1024:\n",
    "\n",
    "    datafile1024.create_dataset('X_train', data=data1024_train['X_train'])\n",
    "    datafile1024.create_dataset('Y_train_classname', data=data1024_train['Y_train_classname'])\n",
    "    datafile1024.create_dataset('Y_train', data=data1024_train['Y_train'])\n",
    "    \n",
    "    datafile1024.create_dataset('X_dev', data=data1024_dev['X_dev'])\n",
    "    datafile1024.create_dataset('Y_dev_classname', data=data1024_dev['Y_dev_classname'])\n",
    "    datafile1024.create_dataset('Y_dev', data=data1024_dev['Y_dev'])\n",
    "    \n",
    "    datafile1024.create_dataset('X_test', data=data1024_test['X_test'])\n",
    "    datafile1024.create_dataset('Y_test_classname', data=data1024_test['Y_test_classname'])\n",
    "    datafile1024.create_dataset('Y_test', data=data1024_test['Y_test'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Test set for voting accuracy\n",
    "\n",
    "We'll augment the test set second time. This time we'll store the dataset a bit differently from previous methods. We'll have multiple input data for each example with the same label. During test time, we'll feed all those inputs to the neural network, which will predict the output classes for each of the inputs. Then the class name in majority is chosen as the output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def augment_test_example(ex, window_size=512, stride=64):\n",
    "    \"\"\"\n",
    "    Augments one dataset example with sliding window approach\n",
    "    \n",
    "    Used with test dataset to create multiple (typically 3) slices from one example\n",
    "    The majority of the predictions on those 3 slices will be considered\n",
    "    \n",
    "    Parameters:\n",
    "    ex: one example from dataset\n",
    "    window_size: sliding window size\n",
    "    stride: sliding window stride\n",
    "    \n",
    "    Returns:\n",
    "    aug_ex: numpy array containing multiple smaller (augmented) examples\n",
    "    \"\"\"\n",
    "    \n",
    "    total_points = ex.shape[0]\n",
    "    aug_ex = None\n",
    "    \n",
    "    # no. of examples generated from single training example using sliding window\n",
    "    # = floor((total_points - window_size) / stride) + 1\n",
    "    iterations = ((total_points - window_size) // stride) + 1\n",
    "\n",
    "    for i in range(iterations):\n",
    "        \n",
    "        # window slice\n",
    "        window_slice_from = i*stride\n",
    "        window_slice_to = i*stride + window_size\n",
    "\n",
    "        try:\n",
    "            aug_ex = np.append(aug_ex, [ex[window_slice_from:window_slice_to]], axis=0)\n",
    "        except:\n",
    "            aug_ex = np.array([ex[window_slice_from:window_slice_to]])\n",
    "    \n",
    "    return aug_ex\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class_map = { 'A': 0, 'B': 0, 'C': 1, 'D': 1, 'E': 2 }\n",
    "\n",
    "def create_test_set_for_voting(testset, window_size=512, stride=64, divisions=2):\n",
    "    \"\"\"\n",
    "    Augments whole test set in a way where each example contains multiple smaller slices\n",
    "    \n",
    "    The augmented dataset created by this function will be used by accuracy for voting measure\n",
    "    Each example is divided into several divisions\n",
    "    Each division consists of multiple slices (smaller examples)\n",
    "    \n",
    "    Parameters:\n",
    "    testset: dictionary containing test examples for each class\n",
    "    window_size: sliding window size\n",
    "    stride: sliding window stride\n",
    "    divisions: number of different examples to create from one example\n",
    "    \n",
    "    Returns:\n",
    "    (new_test_set_X, new_test_set_Y, new_test_set_Y_onehot): numpy array of slices, labels, one-hot labels\n",
    "    \"\"\"\n",
    "    \n",
    "    new_test_set_X = None\n",
    "    new_test_set_Y = []\n",
    "    new_test_set_Y_onehot = None\n",
    "    \n",
    "    no_of_classes = 3\n",
    "    identity = np.eye(no_of_classes)\n",
    "    \n",
    "    # loop over classes\n",
    "    for c,class_data in testset.items():\n",
    "        \n",
    "        # loop over training examples\n",
    "        for ex in class_data:\n",
    "            \n",
    "            for d in range(divisions):\n",
    "                \n",
    "                division_size = ex.shape[0] // divisions\n",
    "            \n",
    "                aug_ex = augment_test_example(ex[d*division_size:(d+1)*division_size], window_size=window_size, stride=stride)\n",
    "                one_hot = identity[class_map[c]]\n",
    "\n",
    "                try:\n",
    "                    new_test_set_X = np.append(new_test_set_X, [aug_ex], axis=0)\n",
    "                    new_test_set_Y_onehot = np.append(new_test_set_Y_onehot, [one_hot], axis=0)\n",
    "                except:\n",
    "                    new_test_set_X = np.array([aug_ex])\n",
    "                    new_test_set_Y_onehot = np.array([one_hot])\n",
    "                new_test_set_Y = np.append(new_test_set_Y, class_map[c])\n",
    "        \n",
    "    return new_test_set_X, new_test_set_Y, new_test_set_Y_onehot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_X_v_512, test_Y_v_512, test_Y_oh_v_512 = create_test_set_for_voting(raw_dataset_test, window_size=512, stride=256, divisions=4)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_X_v_1024, test_Y_v_1024, test_Y_oh_v_1024 = create_test_set_for_voting(raw_dataset_test, window_size=1024, stride=512, divisions=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The shape of the augmented test set for accuracy with voting is such that the dimensions represent:\n",
    "\n",
    "For X<br>\n",
    "1 - No. of test examples<br>\n",
    "2 - No. of slices for each example<br>\n",
    "3 - No. of features<br>\n",
    "<br>\n",
    "For Y_onehot<br>\n",
    "1 - No. of test examples<br>\n",
    "2 - No. of classes<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(test_X_v_512.shape)\n",
    "print(test_X_v_1024.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Save the augmented test set (for voting) in a new file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with h5.File(export_to + 'testset_voting_512.h5', 'w') as testset_voting_512:\n",
    "    \n",
    "    testset_voting_512.create_dataset('X', data=test_X_v_512)\n",
    "    testset_voting_512.create_dataset('Y', data=test_Y_v_512)\n",
    "    testset_voting_512.create_dataset('Y_onehot', data=test_Y_oh_v_512)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with h5.File(export_to + 'testset_voting_1024.h5', 'w') as testset_voting_1024:\n",
    "    \n",
    "    testset_voting_1024.create_dataset('X', data=test_X_v_1024)\n",
    "    testset_voting_1024.create_dataset('Y', data=test_Y_v_1024)\n",
    "    testset_voting_1024.create_dataset('Y_onehot', data=test_Y_oh_v_1024)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}