[9b26b7]: / docs / cybdv_notebook.ipynb

Download this file

885 lines (884 with data), 29.2 kB

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "CYBDV 0.9.0.ipynb",
      "provenance": [],
      "collapsed_sections": [
        "zB5_KtSjEXNM",
        "2SYKnRZkf0Mw",
        "yJg0WYreWUdn",
        "bwX6WgFtxrzK",
        "xHP_ixfovmxV",
        "uRWlMbfByLZW",
        "b_h6smGdyaeh",
        "jliXZz0ye0aP",
        "irKl-FbAfFiF",
        "0Iz89uuRgsuC"
      ]
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hCMyRBybp1Jc",
        "colab_type": "text"
      },
      "source": [
        "\n",
        "# Can you beat DeepVariant?"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "VyiSHJ_TN4lp",
        "colab": {}
      },
      "source": [
        "# @markdown Copyright 2020 Google LLC. \\\n",
        "# @markdown SPDX-License-Identifier: Apache-2.0\n",
        "# @markdown (license hidden in Colab)\n",
        "\n",
        "# Copyright 2020 Google LLC\n",
        "\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "\n",
        "#     https://www.apache.org/licenses/LICENSE-2.0\n",
        "\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VXCUNpFYwPkT",
        "colab_type": "text"
      },
      "source": [
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://colab.research.google.com/github/google/deepvariant/blob/r1.6.1/docs/cybdv_notebook.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
        "  </td>\n",
        "  <td>\n",
        "    <a target=\"_blank\" href=\"https://github.com/google/deepvariant/blob/r1.6.1/docs/cybdv_notebook.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View source on GitHub</a>\n",
        "  </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-ZF4GR4A_L1c",
        "colab_type": "text"
      },
      "source": [
        "DeepVariant turns genomic data into a pileup image, and then uses a convolutional neural network (CNN) to classify these images. This notebook lets you try your hand at classifying these pileups, just like DeepVariant does.\n",
        "\n",
        "\n",
        "DeepVariant gets over 99.8% of variants correct, so we chose the variants that were hardest to make this more interesting. The \"difficult\" variants were chosen because either DeepVariant was not very confident in its choice (lower than 90% likelihood of its top choice) or it made the wrong choice. We also selected 80 of the easy variants to start off with.\n",
        "\n",
        "\n",
        "DeepVariant has not been trained on these variants. They are from the HG002 test set, labeled with \"truth\" from GIAB."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_iDLGr_7DO-e",
        "colab_type": "text"
      },
      "source": [
        "To play, go to `Runtime -> Run all`, then wait a bit until the game starts running (under \"Play the game: EASY\")."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cc05FvB9kNEf",
        "colab_type": "text"
      },
      "source": [
        "The Preparation section below contains setup code that is hidden for clarity. Expand it if you are curious about the details, or just skip down to \"Play the game\".\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zB5_KtSjEXNM",
        "colab_type": "text"
      },
      "source": [
        "# Preparation"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2SYKnRZkf0Mw",
        "colab_type": "text"
      },
      "source": [
        "## Installs and imports"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "IEBMi2lUqYk8",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "%tensorflow_version 2.x"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0NZiS5Hzw0VX",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import tensorflow\n",
        "print(tensorflow.__version__)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "6BKlhfOsDBi_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "%%capture\n",
        "# Don't worry if you see a `Failed building wheel` error, nucleus will still be installed correctly.\n",
        "! pip install \"google-nucleus\""
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "MJ0NcB4f_L1d",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import numpy as np\n",
        "import os\n",
        "import time\n",
        "import tensorflow as tf\n",
        "from tensorflow.core.example import example_pb2\n",
        "from nucleus.protos import variants_pb2\n",
        "from nucleus.util import vis"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yJg0WYreWUdn",
        "colab_type": "text"
      },
      "source": [
        "## Download file with consolidated locus information\n",
        "This is a combination of the make_examples output (containing the pileup tensors), the call_variants output (containing the CNN's output likelihoods to give us DeepVariant's answer), and GIAB ground truth labels."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bwX6WgFtxrzK",
        "colab_type": "text"
      },
      "source": [
        "#### Download data"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "UpWmNmQO7MQs",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "EASY_INPUT='gs://deepvariant/cybdv/cybdv_0.9.0_easy.tfrecord.gz'\n",
        "DIFFICULT_INPUT='gs://deepvariant/cybdv/cybdv_0.9.0_difficult.tfrecord.gz'\n",
        "\n",
        "easy_input='easy.tfrecord.gz'\n",
        "difficult_input='difficult.tfrecord.gz'\n",
        "! time gsutil cp $EASY_INPUT $easy_input\n",
        "! time gsutil cp $DIFFICULT_INPUT $difficult_input"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "b35iBUbDLIYU",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "! ls -ltrsh"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "xHP_ixfovmxV"
      },
      "source": [
        "## Data wrangling and inspection"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "i8oHeR6u1SIZ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Create a Dataset object for each set of tfrecords\n",
        "raw_difficult_dataset = tf.data.TFRecordDataset(difficult_input, compression_type=\"GZIP\")\n",
        "raw_easy_dataset = tf.data.TFRecordDataset(easy_input, compression_type=\"GZIP\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "OGdtNAIV1VoW",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# We can inspect an example to see what is inside:\n",
        "for e in raw_easy_dataset.take(1):\n",
        "  example = tf.train.Example()\n",
        "  example.ParseFromString(e.numpy())\n",
        "  print(example)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uRWlMbfByLZW",
        "colab_type": "text"
      },
      "source": [
        "### Top-level parsing of each locus"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "R7uh7kQv1hoh",
        "colab_type": "text"
      },
      "source": [
        "We can often work with the examples from there, but TensorFlow has some nice methods for filtering if we first tell it about the structure of the example and how to parse the features:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "7Wu-1zqjvmxY",
        "colab": {}
      },
      "source": [
        "# Describe the features. These were defined in the prep script and can also be inspected above.\n",
        "consolidated_variant_description = {\n",
        "  'locus_id': tf.io.FixedLenFeature([], tf.string),\n",
        "  'multiallelic': tf.io.FixedLenFeature([], tf.int64),\n",
        "  'difficulty': tf.io.FixedLenFeature([], tf.int64),\n",
        "  'genotypes': tf.io.VarLenFeature(tf.string)\n",
        "}\n",
        "\n",
        "# Quick parsing at the top level\n",
        "def quick_parse_locus(locus_proto):\n",
        "  return tf.io.parse_single_example(locus_proto, consolidated_variant_description)\n",
        "\n",
        "difficult_dataset = raw_difficult_dataset.map(quick_parse_locus)\n",
        "easy_dataset = raw_easy_dataset.map(quick_parse_locus)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VEsKL0662CNm",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "for e in easy_dataset.take(1):\n",
        "  print(e)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "b_h6smGdyaeh",
        "colab_type": "text"
      },
      "source": [
        "### Filtering to useful subsets of the datasets"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "d6vGcesEl0V8",
        "colab_type": "text"
      },
      "source": [
        "We have only included 80 easy examples (out of over 3 million) along with all difficult examples."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "i-VXke1LwKeW",
        "colab_type": "code",
        "outputId": "acf6fd47-cfdf-4d19-e01e-e40ab6041d49",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 51
        }
      },
      "source": [
        "# Simple counter\n",
        "def count_calls(calls):\n",
        "  count = 0\n",
        "  for c in calls:\n",
        "    count += 1\n",
        "  return count\n",
        "\n",
        "print('Number of easy examples:', count_calls(easy_dataset))\n",
        "print('Number of difficult examples:', count_calls(difficult_dataset))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Number of easy examples: 80\n",
            "Number of difficult examples: 5696\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-mjsKHPSyBR7",
        "colab_type": "text"
      },
      "source": [
        "The parsing above is only at the top level, and we will parse those binary 'variant' and 'genotypes' fields next, but first we can start to do some filtering using the simple features, namely 'multiallelic' and 'difficulty':"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "ETtWfIBzvmxd",
        "colab": {}
      },
      "source": [
        "# A few examples of easy variants\n",
        "easy_biallelics = easy_dataset.filter(lambda x: tf.equal(x['multiallelic'], False))\n",
        "easy_multiallelics = easy_dataset.filter(lambda x: tf.equal(x['multiallelic'], True))\n",
        "\n",
        "# Where DeepVariant had less than 90% likelihood for its choice OR it chose wrong\n",
        "difficult_biallelics = difficult_dataset.filter(lambda x: tf.equal(x['multiallelic'], False))\n",
        "difficult_multiallelics = difficult_dataset.filter(lambda x: tf.equal(x['multiallelic'], True))\n",
        "\n",
        "# In the prep script, we set difficult=100 when DeepVariant's top choice did not match the truth (i.e. DeepVariant got it wrong)\n",
        "dv_missed = difficult_dataset.filter(lambda x: tf.equal(x['difficulty'], 100))\n",
        "\n",
        "# Optional counting of examples (commented out as these take several seconds to run)\n",
        "# Uncomment these if you want to see the counts of different types of variants.\n",
        "# print('easy_biallelics count:', count_calls(easy_biallelics))\n",
        "# print('easy_multiallelics count:', count_calls(easy_multiallelics))\n",
        "# print('difficult_biallelics count', count_calls(difficult_biallelics))\n",
        "# print('difficult_multiallelics count:', count_calls(difficult_multiallelics))\n",
        "# print('dv_missed count:', count_calls(dv_missed))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jliXZz0ye0aP",
        "colab_type": "text"
      },
      "source": [
        "### Parsing genotypes within each locus"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0Oe646M_OJ6t",
        "colab_type": "text"
      },
      "source": [
        "Similar to what we did at the top level, unwrap each genotype, parse the nested protos in 'variant' and each genotype's 'example', and then turn all the remaining types into native Python data structures:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "usBd2-8R_doE",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def bytes_to_str(b):\n",
        "  if isinstance(b, type('')):\n",
        "    return b\n",
        "  elif isinstance(b, type(b'')):\n",
        "    return b.decode()\n",
        "  else:\n",
        "    raise ValueError('Incompatible type: {}. Expected bytes or str.'.format(type(b)))\n",
        "\n",
        "def fully_parse_locus(top_level_parsed):\n",
        "  # where top_level_parsed = tf.io.parse_single_example(locus_proto, consolidated_variant_description)\n",
        "\n",
        "  def _clean_locus(locus):\n",
        "    return {\n",
        "        'locus_id': bytes_to_str(locus['locus_id'].numpy()),\n",
        "        'multiallelic': bool(locus['multiallelic'].numpy()),\n",
        "        'genotypes': locus['genotypes'],\n",
        "        'difficulty': locus['difficulty']\n",
        "    }\n",
        "  clean_locus = _clean_locus(top_level_parsed)\n",
        "  genotype_description = {\n",
        "    'example': tf.io.FixedLenFeature([], tf.string),\n",
        "    'truth_label': tf.io.FixedLenFeature([], tf.int64),\n",
        "    'genotype_probabilities': tf.io.VarLenFeature(tf.float32),\n",
        "    'dv_correct': tf.io.FixedLenFeature([], tf.int64),\n",
        "    'dv_label': tf.io.FixedLenFeature([], tf.int64),\n",
        "    'alt': tf.io.FixedLenFeature([], tf.string)\n",
        "  }\n",
        "  def _parse_genotype(sub_example):\n",
        "    return tf.io.parse_single_example(sub_example, genotype_description)\n",
        "\n",
        "  def _clean_genotype(e):\n",
        "    genotype_example = tf.train.Example()\n",
        "    genotype_example.ParseFromString(e['example'].numpy())\n",
        "    return {\n",
        "        'genotype_probabilities': list(e['genotype_probabilities'].values.numpy()),\n",
        "        'dv_correct': bool(e['dv_correct'].numpy()),\n",
        "        'dv_label': e['dv_label'].numpy(),\n",
        "        'truth_label': e['truth_label'].numpy(),\n",
        "        'example': genotype_example,\n",
        "        'alt': bytes_to_str(e['alt'].numpy())\n",
        "    }\n",
        "\n",
        "\n",
        "  genotypes = clean_locus['genotypes'].values\n",
        "  parsed_genotypes = []\n",
        "  for s in genotypes:\n",
        "    genotype_dict = _parse_genotype(s)\n",
        "    clean_genotype_dict = _clean_genotype(genotype_dict)\n",
        "    parsed_genotypes.append(clean_genotype_dict)\n",
        "\n",
        "  clean_locus['genotypes'] = parsed_genotypes\n",
        "  return clean_locus"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cMivUT-TOHbN",
        "colab_type": "text"
      },
      "source": [
        "Let's look at one locus:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8Fwa_YueJl9i",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "for e in easy_biallelics.take(1):\n",
        "  example = fully_parse_locus(e)\n",
        "  print(example)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "irKl-FbAfFiF",
        "colab_type": "text"
      },
      "source": [
        "### Inspecting some loci"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZJ_nQYA1IMkq",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def pretty_print_locus(locus):\n",
        "  # initial information:\n",
        "  allelic_string = \"Multiallelic\" if locus['multiallelic'] else \"Biallelic\" \n",
        "  print(\"%s -- %s: %d example%s\" % (locus['locus_id'], allelic_string, len(locus['genotypes']), \"s\" if locus['multiallelic'] else \"\"))\n",
        "\n",
        "def show_all_genotypes(genotypes):\n",
        "  # Show pileup images for all genotypes\n",
        "  for g in genotypes:\n",
        "    vis.draw_deepvariant_pileup(example=g['example'])\n",
        "    # And here's how DeepVariant did:\n",
        "    print(\"Truth: %d, DeepVariant said: %d, Correct: %s\" % (g['truth_label'], g['dv_label'], g['dv_correct']))\n",
        "    print(\"DeepVariant genotype likelihoods:\", g['genotype_probabilities'])\n",
        "    print(\"\\n\\n\")\n",
        "\n",
        "def show_loci(loci, show_pileups=True):\n",
        "  for raw_locus in loci:\n",
        "    if show_pileups:\n",
        "      print(\"_____________________________________________________________\")\n",
        "    locus = fully_parse_locus(raw_locus)\n",
        "    pretty_print_locus(locus)\n",
        "    if show_pileups:\n",
        "      show_all_genotypes(locus['genotypes'])\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "vG7xZlXnQqat",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "show_loci(easy_biallelics.take(1))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "GLba73KuthEN",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "show_loci(easy_multiallelics.take(1))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "j_Kw5fxtlVT2",
        "colab_type": "text"
      },
      "source": [
        "All multiallelic loci will have at least 3 examples (pileup tensors) because there is one for each set of one or two alternate alleles. Those with three alternate alleles produce 6 examples."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "outputId": "a33d573a-4628-4776-bed5-63eb4c33b5af",
        "id": "BFq9DN8cvmxm",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 187
        }
      },
      "source": [
        "show_loci(difficult_multiallelics.take(10), show_pileups=False)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "1:194112316_T -- Multiallelic: 3 examples\n",
            "2:130397840_GGT -- Multiallelic: 3 examples\n",
            "2:182813337_TTTTCTTTCTTTC -- Multiallelic: 6 examples\n",
            "3:30941293_AGTGTGTGTGTGTGT -- Multiallelic: 6 examples\n",
            "3:124125678_C -- Multiallelic: 6 examples\n",
            "3:188317538_G -- Multiallelic: 3 examples\n",
            "5:102883892_T -- Multiallelic: 3 examples\n",
            "5:132835059_T -- Multiallelic: 6 examples\n",
            "6:12399702_CAAAAA -- Multiallelic: 3 examples\n",
            "6:22191313_GAA -- Multiallelic: 3 examples\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0Iz89uuRgsuC",
        "colab_type": "text"
      },
      "source": [
        "## Set up the game-play code"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-N8gcXjlgr_j",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "WRITE_NORMAL = \"\\x1b[0m\"\n",
        "WRITE_GREEN_BACKGROUND = \"\\x1b[102m\"\n",
        "WRITE_RED_BACKGROUND = \"\\x1b[101m\"\n",
        "\n",
        "def play_game(calls, pro_mode=False, put_results_here=None):\n",
        "  \"\"\"\n",
        "  Args:\n",
        "    put_results_here: a list, allows saving results along the way even if the player stops the loop\n",
        "  \"\"\"\n",
        "  # for example, calls = dataset.take(5)\n",
        "  print(\"Can you beat DeepVariant?: type 0, 1, or 2 just like DeepVariant's CNN would for each example.\")\n",
        "  results = []\n",
        "  score = 0\n",
        "  total = 0\n",
        "  dv_score = 0\n",
        "  for c in calls:\n",
        "    locus = fully_parse_locus(c)\n",
        "    print(\"___________________________________________________________\")\n",
        "    print(locus['locus_id'])\n",
        "    allelic_string = \"Multiallelic\" if locus['multiallelic'] else \"Biallelic\" \n",
        "    print(\"%s: %d example%s\" % (allelic_string, len(locus['genotypes']), \"s\" if locus['multiallelic'] else \"\"))\n",
        "    quit_early = False\n",
        "    # For every genotype, show the user the images and ask for their guess\n",
        "    for e in locus['genotypes']:\n",
        "      # Draw pileups:\n",
        "      vis.draw_deepvariant_pileup(example=e['example'])\n",
        "      print('Genotype in question: ', e['alt'])\n",
        "      # Ask user until we get a 0, 1, or 2\n",
        "      while True:\n",
        "        try:\n",
        "          guess = input(\"Your answer (0, 1, or 2):\")\n",
        "          if guess in ['exit', 'quit', 'q']:\n",
        "            quit_early = True\n",
        "            break\n",
        "          guess = int(guess)\n",
        "          if guess in [0,1,2]:\n",
        "            break\n",
        "          else:\n",
        "            print(\"Please enter only 0, 1, or 2.\")\n",
        "        except:\n",
        "          print(\"Please enter only 0, 1, or 2. Or enter q to quit.\")\n",
        "          continue\n",
        "      if quit_early:\n",
        "        break\n",
        "      truth = e['truth_label']\n",
        "      if truth == guess:\n",
        "        print(WRITE_GREEN_BACKGROUND + \"You are correct!\", WRITE_NORMAL)\n",
        "        score += 1\n",
        "      else:\n",
        "        print(WRITE_RED_BACKGROUND + \"Nope! Sorry it's actually\", truth, WRITE_NORMAL)\n",
        "      total += 1\n",
        "      # And here's how DeepVariant did:\n",
        "      if e['dv_correct']:\n",
        "        dv_score += 1\n",
        "      dv_result_string = WRITE_GREEN_BACKGROUND + \"is correct\" + WRITE_NORMAL if e['dv_correct'] else WRITE_RED_BACKGROUND + \"failed\" + WRITE_NORMAL\n",
        "      print(\"DeepVariant %s with likelihoods:\" % (dv_result_string), e['genotype_probabilities'])\n",
        "      result = {'id': locus['locus_id'], 'truth': truth, 'guess': guess, 'dv_label': e['dv_label']}\n",
        "      if pro_mode:\n",
        "        result['user_supplied_classification'] = input(\"Classification?\")\n",
        "      if put_results_here != None:\n",
        "        put_results_here.append(result)\n",
        "      results.append(result)\n",
        "    if quit_early:\n",
        "      break\n",
        "  print(\"=============================================================\")\n",
        "  print(\"=============================================================\")\n",
        "  print(\"=============================================================\")\n",
        "  print(\"Your score is\", score, \"/\", total)\n",
        "  print(\"DeepVariant's score is\", dv_score)\n",
        "  return results"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "n1wYLHFhEou7",
        "colab_type": "text"
      },
      "source": [
        "# Now let's play the game!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JQjaaLiNlq1I",
        "colab_type": "text"
      },
      "source": [
        "## Play the game: EASY"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Cu5EW8oTBBUw",
        "colab_type": "text"
      },
      "source": [
        "These are examples that DeepVariant was at least 99% sure about and got correct. Think of this as a tutorial. Once you can easily get them all right, try the difficult examples below.\n",
        "\n",
        "Your job is to pick 0, 1, or 2, depending on how many copies of the given alternate allele you see in the pileup."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "b40RIsydPPzg",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "easy_biallelic_results = play_game(easy_biallelics.shuffle(80).take(6))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XuYY0qZNpIiB",
        "colab_type": "text"
      },
      "source": [
        "## Play the game: DIFFICULT"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1x6FidbVbOip",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "difficult_biallelic_results = play_game(difficult_biallelics.shuffle(1000).take(10))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "g-9ckwTUpYTv",
        "colab_type": "text"
      },
      "source": [
        "## Play the game: Multiallelics only"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vo2q9y0yck6F",
        "colab_type": "text"
      },
      "source": [
        "For multiallelics, it's possible to cheat because the pileups are related. Once you know the answer to one or two pileups, you can use that information to help infer the remaining calls. Note that DeepVariant doesn't get this information, so you have an unfair advantage here.\n",
        "\n",
        "Just be careful, the `base differs from ref` channel will light up for ALL alt alleles. Pay more attention to the reads highlighted (white) in the `read supports variant` channel, and choose 0, 1, or 2 depending on how many copies of THIS alternate allele are shown."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "z-4cRIBylvhW",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "easy_multiallelic_results = play_game(easy_multiallelics.shuffle(9).take(3))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "BcSjB3X-DA_Z",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "difficult_multiallelic_results = play_game(difficult_multiallelics.shuffle(50).take(3))"
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}