--- a
+++ b/Analysis.ipynb
@@ -0,0 +1,592 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "jhudeCHQxeQE"
+   },
+   "outputs": [],
+   "source": [
+    "\"\"\"\n",
+    "Initialize environment and import necessary libraries for the analysis\n",
+    "of diversity in head and neck cancer clinical trials.\n",
+    "\"\"\"\n",
+    "\n",
+    "import warnings\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "\n",
+    "pd.set_option('display.max_rows', 200)\n",
+    "pd.set_option('display.max_columns', 500)\n",
+    "\n",
+    "# Suppress FutureWarning messages\n",
+    "warnings.simplefilter(action='ignore')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "kQA_ILjjzJ3S"
+   },
+   "outputs": [],
+   "source": [
+    "# Import visualization libraries for creating interactive plots\n",
+    "import plotly\n",
+    "from plotly import graph_objects as go\n",
+    "import plotly.express as px\n",
+    "from plotly.subplots import make_subplots"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "VrlIkrAc2jgZ"
+   },
+   "source": "# Exclusion criteria for the analysis\n\nThis analysis examines diversity in head and neck cancer clinical trials, focusing on studies conducted in the United States with reported race information.\n\nThe dataset contains studies conducted all over the world. Considering the cultural context of the United States, we include only those studies that were performed in the US only. Further, given that we need to consider the information on the race of the participants, we apply an additional filter to consider only those studies where both \"Num. White participants\" and \"Num. Non-white participants\" were reported (field was not blank).\n\nThe number of studies with the successive filters are as follows:\n- Total number of studies: 278\n- Total number of studies performed in USA only: 187\n- Total number of studies that contain information on race: 116\n\nOur diversity metric is defined as:\n- Diversity Score = (# non-white participants) / (# total participants) × 100\n- Where total participants = # white participants + # non-white participants",
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "HJUIRbTAxy0m",
+    "outputId": "72aeb624-21a4-4fd0-815f-c3a378d32115"
+   },
+   "outputs": [],
+   "source": [
+    "df = pd.read_csv(\"all_studies.csv\")\n",
+    "df.columns, df.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "WpTiHpwfMT3R",
+    "outputId": "e7f408d9-d0f4-4a48-d696-6a3e32ccd38a"
+   },
+   "outputs": [],
+   "source": [
+    "df['Area Offered'].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 542
+    },
+    "id": "eIetbQkwyLDS",
+    "outputId": "12d32dfc-015f-42e5-d5d2-06e1a567ffe7"
+   },
+   "outputs": [],
+   "source": [
+    "df_studies_per_area = df['Area Offered'].value_counts().reset_index().rename(\n",
+    "    columns={\"index\": \"Countries\", 'Area Offered': \"Counts\"}\n",
+    ")\n",
+    "\n",
+    "px.bar(\n",
+    "    df_studies_per_area,\n",
+    "    x=\"Countries\",\n",
+    "    y=\"Counts\"\n",
+    ").update_layout(\n",
+    "      xaxis=dict(title=\"place where the study happened\"),\n",
+    "      yaxis=dict(title=\"how many studies per place\"),\n",
+    ")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "MfrAdd9-1HZU",
+    "outputId": "fbb7a234-8a60-424a-8c24-f9931e8385c7"
+   },
+   "outputs": [],
+   "source": [
+    "# Filter for USA studies only\n",
+    "df_usa = df[df['Area Offered'] == \"United States\"]\n",
+    "\n",
+    "# Filter for prescence of race information\n",
+    "df_final = df_usa[~(df_usa[\"# White\"].isna() | df_usa[\"# Non White\"].isna())]\n",
+    "\n",
+    "print(f\"Num. studies in USA: {df_usa.shape[0]}\")\n",
+    "print(f\"Num. studies in USA AND contains race information: {df_final.shape[0]}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "6phzTBQ0WU-u",
+    "outputId": "c00ca199-c073-4d6b-97cc-a69de58f73ee"
+   },
+   "outputs": [],
+   "source": [
+    "a=3\n",
+    "b = \"meow\"\n",
+    "c=(1, 4, 8)\n",
+    "print(f\"{a}hello{b}is{c}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "ISJQ_jTGY98n"
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "TYeHSci14imd"
+   },
+   "source": "# Success metric and its distribution\n\nThis analysis defines the \"success metric\" of a study as the percentage of non-white participants in the given study. This metric helps us quantify diversity and compare studies objectively.\n\n**Success Metric** = (# non-white participants) / (# total participants) × 100\n\nConsidering the above success metric, we can arrive at the following statistics:\n- Avg. success percentage: 14.80%\n- Median success percentage: 11.55%\n- 20th percentile success percentage (low success): 4.76%\n- 80th percentile success percentage (high success): 21.93%\n\nWe use the 20th and 80th percentiles to categorize studies into \"Bottom20\" (low diversity) and \"Top20\" (high diversity) groups for comparative analysis.",
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "MNCzqa3K2iyx",
+    "outputId": "96942c20-55e1-476b-ee65-ba5ba17fcaa4"
+   },
+   "outputs": [],
+   "source": [
+    "df_final[\"success_metric\"] = df_final[\"# Non White\"] / (df_final[\"# White\"] + df_final[\"# Non White\"]) * 100.0\n",
+    "\n",
+    "print(f\"Avg. success percentage: {df_final.success_metric.mean()} %\")\n",
+    "print(f\"Median success percentage: {np.quantile(df_final.success_metric, 0.5)} %\")\n",
+    "print(f\"20th percentile success percentage (low success): {np.quantile(df_final.success_metric, 0.2)} %\")\n",
+    "print(f\"80th percentile success percentage (high success): {np.quantile(df_final.success_metric, 0.8)} %\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 542
+    },
+    "id": "6Ywo1KNw9NiC",
+    "outputId": "c44c97f5-da2b-4430-9bae-6aeab5fb653b"
+   },
+   "outputs": [],
+   "source": [
+    "# Cumulative distribution function of the success metric\n",
+    "hist, bins = np.histogram(df_final[\"success_metric\"], bins=100)\n",
+    "cdf = np.cumsum(hist)\n",
+    "cdf = cdf/cdf[-1]\n",
+    "\n",
+    "px.line(x=bins[:-1], y=cdf).update_xaxes(title=\"Success Metric: %age Non-White particpants\").update_yaxes(title=\"Fraction of studies\").update_layout(width = 800)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "KkkZ6oYNf_0R"
+   },
+   "outputs": [],
+   "source": [
+    "success_metric_20th_perc = np.quantile(df_final.success_metric, 0.2)\n",
+    "success_metric_80th_perc = np.quantile(df_final.success_metric, 0.8)\n",
+    "\n",
+    "df_top_20 = df_final[df_final.success_metric >= success_metric_80th_perc]\n",
+    "df_bottom_20 = df_final[df_final.success_metric <= success_metric_20th_perc]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "WnaZyzbRf_3k"
+   },
+   "outputs": [],
+   "source": [
+    "df_top_20"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "8Jj0t1vtf_6s"
+   },
+   "outputs": [],
+   "source": [
+    "df_bottom_20"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "qFpVFWNwf_9c"
+   },
+   "outputs": [],
+   "source": [
+    "df_top_20.to_csv(\"top_20_studies.csv\")\n",
+    "df_bottom_20.to_csv(\"bottom_20_studies.csv\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "NaiFIAslgACf"
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "VzBqdwyJAsr-"
+   },
+   "source": "# Distribution by success categories\n\nBased on the success metric (% of non-white participants), we categorize the studies into three groups:\n- **Top20**: Top 20% of the studies by success metric (most diverse)\n- **Bottom20**: Bottom 20% of the studies by success metric (least diverse) \n- **Neither**: Studies in the middle 60%\n\nThis categorization allows us to compare factors that might contribute to diversity by examining the differences between highly diverse and less diverse studies. The key factors we analyze include:\n\n1. **Eligibility Criteria**: Restrictions on participant eligibility, including:\n   - Age restrictions beyond standard 18+ requirement\n   - Cancer stage or tumor size restrictions\n   - Cancer site restrictions\n   - Histological type restrictions (e.g., SCC only)\n   - Performance score requirements\n   - Comorbidity restrictions\n   - Treatment history restrictions\n   - Laboratory value requirements\n   - Pregnancy/contraception requirements\n   - Other restrictions (smoking status, ethnicity, etc.)\n\n2. **Study Characteristics**:\n   - Single vs. multi-institution studies\n   - Number of participants \n   - Geographic location\n   - Male/female ratio\n   - Trial type (Primary/Palliative/Recurrent/Metastatic)\n   - Modality (Drug/Radiation/Biological/Combination)",
+   "outputs": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "K7Bl9Et3A5iW"
+   },
+   "outputs": [],
+   "source": [
+    "top_20_success_metric_threshold = np.quantile(df_final[\"success_metric\"], 0.8)\n",
+    "bottom_20_success_metric_threshold = np.quantile(df_final[\"success_metric\"], 0.2)\n",
+    "\n",
+    "def get_category_label(x):\n",
+    "  if x >= top_20_success_metric_threshold:\n",
+    "    return \"Top20\"\n",
+    "  elif x<= bottom_20_success_metric_threshold:\n",
+    "    return \"Bottom20\"\n",
+    "  else:\n",
+    "    return \"Neither\"\n",
+    "\n",
+    "df_final[\"success_category\"] = df_final[\"success_metric\"].apply(lambda x: get_category_label(x))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "5BOhcCA1HFb7"
+   },
+   "outputs": [],
+   "source": [
+    "categories = [\"Top20\", \"Bottom20\", \"Neither\"]\n",
+    "\n",
+    "def compare_field_by_category(df, field, height=900, width=1200):\n",
+    "  fig = make_subplots(rows=3, subplot_titles=categories, vertical_spacing=0.1, shared_xaxes=True)\n",
+    "  for i, category in enumerate(categories):\n",
+    "    df_category = df[df[\"success_category\"] == category][field].value_counts().reset_index().rename(columns={\"index\": field, field: \"Num. Studies\"})\n",
+    "    fig.add_trace(\n",
+    "        go.Bar(\n",
+    "            x=df_category[field],\n",
+    "            y=df_category[\"Num. Studies\"],\n",
+    "            name=category\n",
+    "        ),\n",
+    "        row=i+1,\n",
+    "        col=1\n",
+    "    )\n",
+    "\n",
+    "  fig.update_layout(height=height, width=width).show()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 917
+    },
+    "id": "kTQgy6FERIBp",
+    "outputId": "90147aee-b04c-40d3-9db6-30d08f5402fd"
+   },
+   "outputs": [],
+   "source": [
+    "#field = \"Modalities\"\n",
+    "field = \"Trial Type \"\n",
+    "#field = \"Cancer Site\"\n",
+    "#field = \"Trial Phase\"\n",
+    "#field = \"Tumor Type\"\n",
+    "\n",
+    "compare_field_by_category(df_final, field)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 1000
+    },
+    "id": "CO6E-_jf9Nps",
+    "outputId": "862abbe5-c7e5-4bdd-ff4d-f57704b0eaf0"
+   },
+   "outputs": [],
+   "source": [
+    "df_final[[\n",
+    "    'Study Title - Link to Page here',\n",
+    "    'Study ID ',\n",
+    "    'Study Start Date',\n",
+    "    'APC Date',\n",
+    "    'Cancer Site',\n",
+    "    'Trial Type ',\n",
+    "    'Trial Phase',\n",
+    "    'Tumor Type',\n",
+    "    'Modalities',\n",
+    "    'Trial Status',\n",
+    "    'Total Included',\n",
+    "    'Median Age',\n",
+    "    'Mean Age',\n",
+    "    'Min Age',\n",
+    "    'Max Age',\n",
+    "    '# Female',\n",
+    "    '# Male',\n",
+    "    '# White',\n",
+    "    '#Hispanic (ethnicity)',\n",
+    "    '# Non White',\n",
+    "    '# Asian',\n",
+    "    '#American Indian',\n",
+    "    '#Native Hawaiian or Pacifi Islande',\n",
+    "    '#Black ',\n",
+    "    '#Not Reported/Other',\n",
+    "    'notes',\n",
+    "    'contact library?',\n",
+    "    'success_metric']]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "OzIfnX1y14OJ"
+   },
+   "outputs": [],
+   "source": [
+    "df_race_reported = df_usa[~(df_usa[\"# White\"].isna() | df_usa[\"# Non White\"].isna())]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "ZKbxMOGM14TP",
+    "outputId": "33d46aee-90c7-41d6-a316-9286211323db"
+   },
+   "outputs": [],
+   "source": [
+    "df_race_reported[df_race_reported[\"# Non White\"] == 0]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "dA_SYJ2o1EA8"
+   },
+   "source": [
+    "The dataset contains"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "86YTj4e7yLHG"
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "1_ZA2ifOxz1P"
+   },
+   "outputs": [],
+   "source": [
+    "df[\"has_valid_participants\"] = ~(df[\"# White\"].isna() | df[\"# Non White\"].isna() | (df[\"# White\"] == 0))\n",
+    "df_filtered = df[df[\"has_valid_participants\"]]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "TZpj5OSKxz3-",
+    "outputId": "535cba76-a04b-43ec-92ce-2bc58bac0dde"
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "df['Area Offered'].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 990
+    },
+    "id": "2Dym0VcXxz7R",
+    "outputId": "1fe9f648-99dc-43cd-bc33-28b9bbcaac3f"
+   },
+   "outputs": [],
+   "source": [
+    "df['Area Offered'].value_counts().reset_index()\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "ybVnRvtjxz-S",
+    "outputId": "3dae211c-bbac-417c-be1c-5f9b2b56d87d"
+   },
+   "outputs": [],
+   "source": [
+    "dict(title=\"Area the trial was offered\", cat='Meow is sweet')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 390
+    },
+    "id": "oEh5Fuotx0A9",
+    "outputId": "ad6fc274-65d3-4f9e-f33c-f5a97ed8da42"
+   },
+   "outputs": [],
+   "source": [
+    "px.bar(\n",
+    "df_studies_per_area,\n",
+    "x=\"area_offered\",\n",
+    "y=\"num_studies\"\n",
+    ").update_layout(\n",
+    "xaxis=dict(title=\"Area the trial was offered\"),\n",
+    "yaxis=dict(title=\"Num. trials\"),\n",
+    ")\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "xx_Vpmn5x0Dl"
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "gKXTbYVxx0GM"
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "eeWr1GJPx0JD"
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "mZZtz3xCx0LR"
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "RnOiuj3gx0Pk"
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "94j5jLwkx0Ss"
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "edfCoN77x0Ve"
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
\ No newline at end of file