|
a |
|
b/README.md |
|
|
1 |
<div class="sc-jegwdG lhLRCf"><div class="sc-UEtKG dGqiYy sc-flttKd cguEtd"><div class="sc-fqwslf gsqkEc"><div class="sc-cBQMlg kAHhUk"><h2 class="sc-dcKlJK sc-cVttbi gqEuPW ksnHgj">About Dataset</h2></div></div></div><div class="sc-davvxH eCVTlP"><div class="sc-jCNfQM dTyvWO"><div style="min-height: 80px;"><div class="sc-etVRix jqYJaa sc-gVIFzB gQKGyV"><p>This is a <strong>brand-new</strong> (!) dataset from an open-access paper <a rel="noreferrer nofollow" aria-label="published December 10, 2020 (opens in a new tab)" target="_blank" href="https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1003489">published December 10, 2020</a>. The paper and the full dataset are open-access (<a rel="noreferrer nofollow" aria-label="CC-BY (opens in a new tab)" target="_blank" href="https://creativecommons.org/licenses/by/4.0/">CC-BY</a>), so please give attribution to the original authors in your work. </p> |
|
|
2 |
<h3>Background</h3> |
|
|
3 |
<p>Pancreatic cancer is an extremely deadly type of cancer. Once diagnosed, the five-year survival rate is less than 10%. However, if pancreatic cancer is caught early, the odds of surviving are much better. Unfortunately, many cases of pancreatic cancer show no symptoms until the cancer has spread throughout the body. A diagnostic test to identify people with pancreatic cancer could be enormously helpful. </p> |
|
|
4 |
<h3>The paper</h3> |
|
|
5 |
<p>In a <a rel="noreferrer nofollow" aria-label="paper (opens in a new tab)" target="_blank" href="https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1003489">paper</a> by Silvana Debernardi and colleagues, published this year in the journal PLOS Medicine, a multi-national team of researchers sought to develop an accurate diagnostic test for the most common type of pancreatic cancer, called pancreatic ductal adenocarcinoma or PDAC. They gathered a series of biomarkers from the urine of three groups of patients: </p> |
|
|
6 |
<ul> |
|
|
7 |
<li>Healthy controls</li> |
|
|
8 |
<li>Patients with non-cancerous pancreatic conditions, like chronic pancreatitis</li> |
|
|
9 |
<li>Patients with pancreatic ductal adenocarcinoma </li> |
|
|
10 |
</ul> |
|
|
11 |
<p>When possible, these patients were age- and sex-matched. The goal was to develop an accurate way to identify patients with pancreatic cancer.</p> |
|
|
12 |
<h3>The data</h3> |
|
|
13 |
<p>The key features are four urinary biomarkers: creatinine, LYVE1, REG1B, and TFF1. </p> |
|
|
14 |
<ul> |
|
|
15 |
<li><strong>Creatinine</strong> is a protein that is often used as an indicator of kidney function. </li> |
|
|
16 |
<li><strong>YVLE1</strong> is lymphatic vessel endothelial hyaluronan receptor 1, a protein that may play a role in tumor metastasis</li> |
|
|
17 |
<li><strong>REG1B</strong> is a protein that may be associated with pancreas regeneration</li> |
|
|
18 |
<li><strong>TFF1</strong> is trefoil factor 1, which may be related to regeneration and repair of the urinary tract</li> |
|
|
19 |
</ul> |
|
|
20 |
<p><strong>Age</strong> and <strong>sex</strong>, both included in the dataset, may also play a role in who gets pancreatic cancer. The dataset includes a few other biomarkers as well, but these were not measured in all patients (they were collected partly to measure how various blood biomarkers compared to urine biomarkers). </p> |
|
|
21 |
<p>I have not changed any of the data from the paper, other than renaming the columns for easy importing and use. The file <code>Debernardi et al 2020 data.csv</code> contains the raw data, while the file <code>Debernardi et al 2020 documentation.csv</code> contains a detailed documentation of what each column represents (as well as the original column names from the paper).</p> |
|
|
22 |
<h3>Prediction task</h3> |
|
|
23 |
<p>The goal in this dataset is predicting <code>diagnosis</code>, and more specifically, differentiating between 3 (pancreatic cancer) versus 2 (non-cancerous pancreas condition) and 1 (healthy). The dataset includes information on stage of pancreatic cancer, and diagnosis for non-cancerous patients, but remember—these won't be available to a predictive model. The goal, after all, is to predict the presence of disease <em>before</em> it's diagnosed, not after! </p> |
|
|
24 |
<h3>Acknowledgements</h3> |
|
|
25 |
<p>I would like to thank the authors of this paper, for graciously sharing their raw data with the research community. </p></div></div></div> |