Switch to unified view

a/README.md b/README.md
1
# Diversity in Head and Neck Cancer Clinical Trials
1
# Diversity in Head and Neck Cancer Clinical Trials
2
2
3
This repository contains code and data for analyzing diversity in head and neck cancer clinical trials, specifically focusing on the inclusion of non-white participants in these studies.
3
This repository contains code and data for analyzing diversity in head and neck cancer clinical trials, specifically focusing on the inclusion of non-white participants in these studies.
4
4
5
## Project Overview
5
## Project Overview
6
6
7
Head and neck cancer disproportionately affects certain racial and ethnic groups. This analysis aims to understand factors that contribute to higher diversity in clinical trials and identify patterns that could lead to more inclusive research.
7
Head and neck cancer disproportionately affects certain racial and ethnic groups. This analysis aims to understand factors that contribute to higher diversity in clinical trials and identify patterns that could lead to more inclusive research.
8
8
9
### Key Components
9
### Key Components
10
10
11
1. **Diversity Metric**: The analysis uses a metric defined as the percentage of non-white participants in each study to measure diversity.
11
1. **Diversity Metric**: The analysis uses a metric defined as the percentage of non-white participants in each study to measure diversity.
12
   - Score = (# non-white participants) / (# total participants) × 100
12
   - Score = (# non-white participants) / (# total participants) × 100
13
   - Total participants = # white participants + # non-white participants
13
   - Total participants = # white participants + # non-white participants
14
14
15
2. **Comparative Analysis**: Studies are categorized into high-diversity (top 20%) and low-diversity (bottom 20%) groups based on this metric.
15
2. **Comparative Analysis**: Studies are categorized into high-diversity (top 20%) and low-diversity (bottom 20%) groups based on this metric.
16
16
17
3. **Factor Identification**: Various factors are examined to understand what contributes to more diverse clinical trials.
17
3. **Factor Identification**: Various factors are examined to understand what contributes to more diverse clinical trials.
18
18
19
## Analysis Methodology
19
## Analysis Methodology
20
20
21
The analysis followed these key steps:
21
The analysis followed these key steps:
22
22
23
1. **Data Collection**: Collected data on all head and neck cancer clinical trials from ClinicalTrials.gov
23
1. **Data Collection**: Collected data on all head and neck cancer clinical trials from ClinicalTrials.gov
24
2. **Diversity Scoring**: Computed a diversity score for each trial based on participant demographics
24
2. **Diversity Scoring**: Computed a diversity score for each trial based on participant demographics
25
3. **Stratification**: Identified the top 20th percentile and bottom 20th percentile of trials by diversity
25
3. **Stratification**: Identified the top 20th percentile and bottom 20th percentile of trials by diversity
26
4. **Feature Extraction**: Extracted key features from each clinical trial:
26
4. **Feature Extraction**: Extracted key features from each clinical trial:
27
   - **Study Characteristics**: Start/end dates, institutional setting, number of participants, location, etc.
27
   - **Study Characteristics**: Start/end dates, institutional setting, number of participants, location, etc.
28
   - **Eligibility Criteria**: Detailed analysis of inclusion/exclusion criteria
28
   - **Eligibility Criteria**: Detailed analysis of inclusion/exclusion criteria
29
5. **Comparative Analysis**: Compared the distribution of features between high-diversity and low-diversity trials
29
5. **Comparative Analysis**: Compared the distribution of features between high-diversity and low-diversity trials
30
30
31
### Eligibility Features Analyzed
31
### Eligibility Features Analyzed
32
32
33
The study examined specific eligibility restrictions and their potential impact on diversity:
33
The study examined specific eligibility restrictions and their potential impact on diversity:
34
34
35
| Feature | Description |
35
| Feature | Description |
36
|---------|-------------|
36
|---------|-------------|
37
| age_restrict | 0 if the restriction is age>18, 1 for other restrictions (e.g., 18<age<75) |
37
| age_restrict | 0 if the restriction is age>18, 1 for other restrictions (e.g., 18<age<75) |
38
| stage_size | Restrictions on the cancer stage and the size of the tumor |
38
| stage_size | Restrictions on the cancer stage and the size of the tumor |
39
| cancer_site | Restrictions on the cancer site |
39
| cancer_site | Restrictions on the cancer site |
40
| histological_type | Whether the study was limited to SCC (Squamous Cell Carcinoma) or any other type |
40
| histological_type | Whether the study was limited to SCC (Squamous Cell Carcinoma) or any other type |
41
| performance_score | Restrictions on performance score (e.g., ECOG performance) |
41
| performance_score | Restrictions on performance score (e.g., ECOG performance) |
42
| comorbidities | Restrictions on comorbidities |
42
| comorbidities | Restrictions on comorbidities |
43
| hx_of_tt | Restrictions on treatment history for cancer |
43
| hx_of_tt | Restrictions on treatment history for cancer |
44
| lab_values | Restrictions on lab test values |
44
| lab_values | Restrictions on lab test values |
45
| pregnancy_or_contraception | Restrictions on pregnancy or particular contraceptives |
45
| pregnancy_or_contraception | Restrictions on pregnancy or particular contraceptives |
46
| misc | Other restrictions (e.g., smoking status, ethnicity requirements) |
46
| misc | Other restrictions (e.g., smoking status, ethnicity requirements) |
47
| eligibility_score | Sum of all restriction scores above |
47
| eligibility_score | Sum of all restriction scores above |
48
48
49
### General Features Analyzed
49
### General Features Analyzed
50
50
51
The analysis also included general study characteristics:
51
The analysis also included general study characteristics:
52
52
53
1. Study start date and end date
53
1. Study start date and end date
54
2. Single vs. multi-institutional study
54
2. Single vs. multi-institutional study
55
3. Stringency in eligibility criteria (composite score)
55
3. Stringency in eligibility criteria (composite score)
56
4. Modality (Drug/Radiation/Biologic/Combination)
56
4. Modality (Drug/Radiation/Biologic/Combination)
57
5. Number of participants
57
5. Number of participants
58
6. Geographic location
58
6. Geographic location
59
7. Male/female ratio
59
7. Male/female ratio
60
8. Trial type (Primary/Palliative/Recurrent/Metastatic)
60
8. Trial type (Primary/Palliative/Recurrent/Metastatic)
61
61
62
## Repository Structure
62
## Repository Structure
63
63
64
```
64
```
65
├── README.md                        # Project documentation
65
├── README.md                        # Project documentation
66
├── src/                             # Source code directory
66
├── src/                             # Source code directory
67
│   ├── data_processing.py           # Functions for data loading and preprocessing
67
│   ├── data_processing.py           # Functions for data loading and preprocessing
68
│   ├── analysis.py                  # Functions for statistical analysis
68
│   ├── analysis.py                  # Functions for statistical analysis
69
│   ├── visualization.py             # Functions for creating visualizations
69
│   ├── visualization.py             # Functions for creating visualizations
70
│   └── main.py                      # Main script that orchestrates the analysis
70
│   └── main.py                      # Main script that orchestrates the analysis
71
├── plots/                           # Generated visualizations
71
├── plots/                           # Generated visualizations
72
│   ├── box_plot_eligbility_score_diverse_vs_non_diverse.png
72
│   ├── box_plot_eligbility_score_diverse_vs_non_diverse.png
73
│   ├── box_plot_num_participants_top_vs_bottom.png
73
│   ├── box_plot_num_participants_top_vs_bottom.png
74
│   ├── distribution_age_restrict.png
74
│   ├── distribution_age_restrict.png
75
│   ├── distribution_comorbidities.png
75
│   ├── distribution_comorbidities.png
76
│   ├── distribution_histological_type.png
76
│   ├── distribution_histological_type.png
77
│   ├── distribution_hx_of_tt.png
77
│   ├── distribution_hx_of_tt.png
78
│   ├── distribution_is_single_institution.png
78
│   ├── distribution_is_single_institution.png
79
│   ├── distribution_lab_values.png
79
│   ├── distribution_lab_values.png
80
│   ├── distribution_misc.png
80
│   ├── distribution_misc.png
81
│   ├── distribution_num_participants_top_vs_bottom_studies_strat_gender.png
81
│   ├── distribution_num_participants_top_vs_bottom_studies_strat_gender.png
82
│   ├── distribution_performance_score.png
82
│   ├── distribution_performance_score.png
83
│   ├── distribution_site.png
83
│   ├── distribution_site.png
84
│   ├── distribution_stage_size.png
84
│   ├── distribution_stage_size.png
85
│   └── geo_distribution.png
85
│   └── geo_distribution.png
86
├── top_20_studies.csv               # Dataset of top 20% diverse studies
86
├── top_20_studies.csv               # Dataset of top 20% diverse studies
87
├── bottom_20_studies.csv            # Dataset of bottom 20% diverse studies
87
├── bottom_20_studies.csv            # Dataset of bottom 20% diverse studies
88
├── Diversity in head and neck clinical trials - plots (2).pdf # PDF with plot descriptions
88
├── Diversity in head and neck clinical trials - plots (2).pdf # PDF with plot descriptions
89
├── Analysis.ipynb                   # Jupyter notebook with initial analysis
89
├── Analysis.ipynb                   # Jupyter notebook with initial analysis
90
└── Analysis top20 vs bottom20.ipynb # Jupyter notebook with comparative analysis
90
└── Analysis top20 vs bottom20.ipynb # Jupyter notebook with comparative analysis
91
```
91
```
92
92
93
## Key Findings
93
94
94
## Data Source
95
### 1. Eligibility Criteria
95
96
96
The data for this analysis was extracted from [ClinicalTrials.gov](https://clinicaltrials.gov/), focusing on head and neck cancer clinical trials conducted in the United States. Only studies that reported race information were included in the analysis.
97
The analysis of eligibility criteria revealed that more diverse studies tend to have fewer restrictive criteria:
97
98
98
99
![Eligibility Score Comparison](plots/box_plot_eligbility_score_diverse_vs_non_diverse.png)
99
## Conclusions
100
100
101
*The above plot shows the distribution of eligibility scores for diverse vs. non-diverse studies. Higher scores indicate more restrictive eligibility criteria.*
101
The analysis identified several factors that are associated with more diverse head and neck cancer clinical trials:
102
102
103
### 2. Geographic Distribution
103
1. **Less restrictive eligibility criteria**: Studies with fewer restrictions tend to have more diverse participation.
104
104
   - Specific criteria that appear to impact diversity include age restrictions, performance score requirements, and histological type restrictions.
105
The geographic location of studies plays a significant role in diversity:
105
106
106
2. **Geographic location**: Studies in areas with more diverse populations have higher diversity scores.
107
![Geographic Distribution](plots/geo_distribution.png)
107
108
108
3. **Institutional setting**: Different types of institutions show varying levels of success in recruiting diverse participants.
109
*This map shows the locations of the top and bottom diverse studies, with color indicating the population diversity score of each location.*
109
110
110
4. **Study size**: There is a relationship between the number of participants and diversity.
111
### 3. Participant Demographics
111
112
112
These findings suggest potential strategies for improving diversity in future clinical trials, such as revisiting eligibility criteria, focusing on inclusive recruitment strategies, and considering geographic factors when planning trial sites.
113
Studies with higher diversity had different participant demographics:
113
114
114
## Running the Analysis
115
![Participant Distribution by Gender](plots/distribution_num_participants_top_vs_bottom_studies_strat_gender.png)
115
116
116
### Prerequisites
117
*This plot shows the distribution of male and female participants in top vs. bottom diverse studies.*
117
118
118
- Python 3.7+
119
### 4. Eligibility Restrictions
119
- Required packages: pandas, numpy, plotly, scipy
120
120
121
Specific eligibility criteria had different prevalence in diverse vs. non-diverse studies:
121
### Usage
122
122
123
- **Age Restrictions**:
123
```bash
124
124
# Run the main analysis script
125
  ![Age Restrictions](plots/distribution_age_restrict.png)
125
python src/main.py
126
  
126
```
127
  *This plot compares the prevalence of age restrictions beyond the standard adult age (18+) between high and low diversity studies.*
127
128
128
Or explore the Jupyter notebooks for an interactive analysis experience:
129
- **Histological Type Restrictions**:
129
130
  
130
```bash
131
  ![Histological Type Restrictions](plots/distribution_histological_type.png)
131
jupyter notebook "Analysis.ipynb"
132
  
132
jupyter notebook "Analysis top20 vs bottom20.ipynb"
133
  *This plot compares the prevalence of restrictions on cancer histological type (e.g., SCC only) between high and low diversity studies.*
133
```
134
134
135
- **Performance Score Restrictions**:
135
## License
136
  
136
137
  ![Performance Score Restrictions](plots/distribution_performance_score.png)
137
This project is licensed under the MIT License - see the LICENSE file for details.
138
  
138
139
  *This plot compares the prevalence of ECOG or other performance score restrictions between high and low diversity studies.*
139
## Acknowledgements
140
140
141
- **Comorbidity Restrictions**:
142
  
143
  ![Comorbidity Restrictions](plots/distribution_comorbidities.png)
144
  
145
  *This plot compares the prevalence of comorbidity restrictions between high and low diversity studies.*
146
147
- **Laboratory Value Restrictions**:
148
  
149
  ![Laboratory Value Restrictions](plots/distribution_lab_values.png)
150
  
151
  *This plot compares the prevalence of laboratory value restrictions between high and low diversity studies.*
152
153
- **Stage/Size Restrictions**:
154
  
155
  ![Stage/Size Restrictions](plots/distribution_stage_size.png)
156
  
157
  *This plot compares the prevalence of tumor stage or size restrictions between high and low diversity studies.*
158
159
- **Site Restrictions**:
160
  
161
  ![Site Restrictions](plots/distribution_site.png)
162
  
163
  *This plot compares the prevalence of cancer site restrictions between high and low diversity studies.*
164
165
- **History of Treatment Restrictions**:
166
  
167
  ![History of Treatment Restrictions](plots/distribution_hx_of_tt.png)
168
  
169
  *This plot compares the prevalence of previous treatment history restrictions between high and low diversity studies.*
170
171
- **Miscellaneous Restrictions**:
172
  
173
  ![Miscellaneous Restrictions](plots/distribution_misc.png)
174
  
175
  *This plot compares the prevalence of other restrictions (such as smoking status or ethnicity requirements) between high and low diversity studies.*
176
177
- **Institutional Setting**:
178
179
  ![Single Institution Distribution](plots/distribution_is_single_institution.png)
180
181
  *This plot shows the distribution of single-institution vs. multi-institution studies among diverse and non-diverse trials.*
182
183
## Data Source
184
185
The data for this analysis was extracted from [ClinicalTrials.gov](https://clinicaltrials.gov/), focusing on head and neck cancer clinical trials conducted in the United States. Only studies that reported race information were included in the analysis.
186
187
188
## Conclusions
189
190
The analysis identified several factors that are associated with more diverse head and neck cancer clinical trials:
191
192
1. **Less restrictive eligibility criteria**: Studies with fewer restrictions tend to have more diverse participation.
193
   - Specific criteria that appear to impact diversity include age restrictions, performance score requirements, and histological type restrictions.
194
195
2. **Geographic location**: Studies in areas with more diverse populations have higher diversity scores.
196
197
3. **Institutional setting**: Different types of institutions show varying levels of success in recruiting diverse participants.
198
199
4. **Study size**: There is a relationship between the number of participants and diversity.
200
201
These findings suggest potential strategies for improving diversity in future clinical trials, such as revisiting eligibility criteria, focusing on inclusive recruitment strategies, and considering geographic factors when planning trial sites.
202
203
## Running the Analysis
204
205
### Prerequisites
206
207
- Python 3.7+
208
- Required packages: pandas, numpy, plotly, scipy
209
210
### Usage
211
212
```bash
213
# Run the main analysis script
214
python src/main.py
215
```
216
217
Or explore the Jupyter notebooks for an interactive analysis experience:
218
219
```bash
220
jupyter notebook "Analysis.ipynb"
221
jupyter notebook "Analysis top20 vs bottom20.ipynb"
222
```
223
224
## License
225
226
This project is licensed under the MIT License - see the LICENSE file for details.
227
228
## Acknowledgements
229
230
This analysis was conducted as part of a research project examining diversity and inclusion in clinical trials for head and neck cancer.
141
This analysis was conducted as part of a research project examining diversity and inclusion in clinical trials for head and neck cancer.