[d7cf27]: / src / examples / sklearn_example_with_kmers.ipynb

Download this file

276 lines (275 with data), 8.1 kB

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# K-mer based TF binding prediction using sklearn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial illustrates how sklearn can be used with Janggu datasets.\n",
    "\n",
    "In particular, we shall use the same toy sequences as in the other tutorial. But this time we fit a few sklearn models to descriminate transcription factor binding events based on k-mer frequencies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/wkopp/miniconda3/envs/jdev/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n",
      "  from ._conv import register_converters as _register_converters\n",
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "from pkg_resources import resource_filename\n",
    "\n",
    "from janggu.data import Bioseq\n",
    "from janggu.data import Cover\n",
    "from janggu.data import ReduceDim\n",
    "from janggu.data import SqueezeDim\n",
    "\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "from IPython.display import Image\n",
    "\n",
    "np.random.seed(1234)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ['JANGGU_OUTPUT'] = '/home/wkopp/janggu_examples'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You might want to play around with the orders, which corresponds to the k-mer length. For small choices e.g. k=1 or 2\n",
    "the performance should end up to be lower than for longer kmers e.g. k=5.\n",
    "However, be aware that too large numbers of k will amount to large memory usage. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "order = 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the dataset\n",
    "# The pseudo genome represents just a concatenation of all sequences\n",
    "# in sample.fa and sample2.fa. Therefore, the results should be almost\n",
    "# identically to the models obtained from classify_fasta.py.\n",
    "REFGENOME = resource_filename('janggu', 'resources/pseudo_genome.fa')\n",
    "# ROI contains regions spanning positive and negative examples\n",
    "ROI_TRAIN_FILE = resource_filename('janggu', 'resources/roi_train.bed')\n",
    "ROI_TEST_FILE = resource_filename('janggu', 'resources/roi_test.bed')\n",
    "# PEAK_FILE only contains positive examples\n",
    "PEAK_FILE = resource_filename('janggu', 'resources/scores.bed')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we obtain the k-mer representation for each sequences as well as the labels.\n",
    "\n",
    "Note: the way sklearn utilizes the data requires to wrap the datasets via SqueezeDim, which is not necessary with keras."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "loading from lazy loader\n",
      "reload /home/wkopp/janggu_examples/datasets/dna/aded3c22c15e2ececdbed2fdf4e19b145ab947eda546abd668b8ae2e76f787e5.npz\n",
      "loading from bed lazy loader\n",
      "reload /home/wkopp/janggu_examples/datasets/peaks/fd9826cf7fb9cc044a6c1354a14688c1be0f0bd9c593fdba2e9af3284a2be099.npz\n",
      "loading from lazy loader\n",
      "loading from bed lazy loader\n"
     ]
    }
   ],
   "source": [
    "# Training input and labels are purely defined genomic coordinates\n",
    "DNA = SqueezeDim(ReduceDim(Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,\n",
    "                                   roi=ROI_TRAIN_FILE,\n",
    "                                   binsize=200,\n",
    "                                   order=order,\n",
    "                                   cache=True)))\n",
    "\n",
    "LABELS = SqueezeDim(ReduceDim(Cover.create_from_bed('peaks', roi=ROI_TRAIN_FILE,\n",
    "                               bedfiles=PEAK_FILE,\n",
    "                               binsize=200,\n",
    "                               resolution=200,\n",
    "                               cache=True,\n",
    "                               storage='sparse')))\n",
    "\n",
    "\n",
    "DNA_TEST = SqueezeDim(ReduceDim(Bioseq.create_from_refgenome('dna', refgenome=REFGENOME,\n",
    "                                        roi=ROI_TEST_FILE,\n",
    "                                        binsize=200,\n",
    "                                        order=order)))\n",
    "\n",
    "LABELS_TEST = SqueezeDim(ReduceDim(Cover.create_from_bed('peaks',\n",
    "                                    bedfiles=PEAK_FILE,\n",
    "                                    roi=ROI_TEST_FILE,\n",
    "                                    binsize=200,\n",
    "                                    resolution=200,\n",
    "                                    storage='sparse')))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "k-mer count matrix used for training is shown below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(7797, 1024)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "DNA.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "we compare the performances for logistic regression, support vector machines and random forest."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "logreg = LogisticRegression()\n",
    "logreg.fit(DNA, LABELS)\n",
    "logregpred = logreg.predict_proba(DNA_TEST)[:,1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "svc = SVC(probability=True)\n",
    "svc.fit(DNA, LABELS)\n",
    "svcpred = svc.predict_proba(DNA_TEST)[:,1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "rf = RandomForestClassifier()\n",
    "rf.fit(DNA, LABELS)\n",
    "rfpred = rf.predict_proba(DNA_TEST)[:,1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Support vector machines outperforms logistic regression and random forest in this example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AUC_LogReg:  0.9055\n",
      "AUC_SVM:  0.9357000000000001\n",
      "AUC_RF:  0.8668\n"
     ]
    }
   ],
   "source": [
    "print(\"AUC_LogReg: \", roc_auc_score(LABELS_TEST[:], logregpred))\n",
    "print(\"AUC_SVM: \", roc_auc_score(LABELS_TEST[:], svcpred))\n",
    "print(\"AUC_RF: \", roc_auc_score(LABELS_TEST[:], rfpred))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}