a/README.md b/README.md
1
### GenoHos: AI-Powered Genomic Analysis with Molecular Phenotyping and RAG Chat Interface
1
### GenoHos: AI-Powered Genomic Analysis with Molecular Phenotyping and RAG Chat Interface
2
2
3
With the rapid advancement of multi-omics technologies, healthcare institutions now generate vast amounts of genomic, proteomic, and metabolomic data. While these datasets hold the key to personalized medicine, they typically reside in disconnected systems - from sequencing machines to EHRs to research databases. Traditionally, integrating this data required teams of bioinformaticians to build complex analysis pipelines, slowing down critical treatment decisions.
3
With the rapid advancement of multi-omics technologies, healthcare institutions now generate vast amounts of genomic, proteomic, and metabolomic data. While these datasets hold the key to personalized medicine, they typically reside in disconnected systems - from sequencing machines to EHRs to research databases. Traditionally, integrating this data required teams of bioinformaticians to build complex analysis pipelines, slowing down critical treatment decisions.
4
4
5
By leveraging mutation Data has led to discovery of novel disease variants. For example breast cancer patient management system serves as a powerful platform for physicians to discover and analyze novel disease variants by integrating clinical and genomic data. By systematically recording patient demographics, treatment outcomes, and mutation profiles, the system enables:
5
By leveraging mutation Data has led to discovery of novel disease variants. For example breast cancer patient management system serves as a powerful platform for physicians to discover and analyze novel disease variants by integrating clinical and genomic data. By systematically recording patient demographics, treatment outcomes, and mutation profiles, the system enables:
6
6
7
**Pattern Recognition** - Identifies recurring mutation clusters across specific age groups, ethnicities, and geographic locations, revealing potential founder mutations or environmental risk factors.
7
**Pattern Recognition** - Identifies recurring mutation clusters across specific age groups, ethnicities, and geographic locations, revealing potential founder mutations or environmental risk factors.
8
8
9
**Variant Alert System** - Automatically flags novel variants and compares them against global databases (COSMIC, ClinVar), highlighting mutations with predicted clinical significance.
9
**Variant Alert System** - Automatically flags novel variants and compares them against global databases (COSMIC, ClinVar), highlighting mutations with predicted clinical significance.
10
10
11
**Treatment Response Analysis** - Correlates specific mutations with drug efficacy, helping physicians identify biomarkers for treatment resistance or sensitivity.
11
**Treatment Response Analysis** - Correlates specific mutations with drug efficacy, helping physicians identify biomarkers for treatment resistance or sensitivity.
12
12
13
**Collaborative Research** - Facilitates secure data sharing across institutions, creating a crowdsourced knowledge base for rare variants and atypical presentations.
13
**Collaborative Research** - Facilitates secure data sharing across institutions, creating a crowdsourced knowledge base for rare variants and atypical presentations.
14
14
15
**Predictive Modeling** - Uses accumulated data to forecast disease progression patterns and suggest personalized therapeutic approaches based on mutation profiles.
15
**Predictive Modeling** - Uses accumulated data to forecast disease progression patterns and suggest personalized therapeutic approaches based on mutation profiles.
16
16
17
The system transforms routine clinical documentation into a dynamic discovery tool, where each new patient record contributes to our understanding of breast cancer heterogeneity. Physicians gain real-time insights into how specific mutations influence: -
17
The system transforms routine clinical documentation into a dynamic discovery tool, where each new patient record contributes to our understanding of breast cancer heterogeneity. Physicians gain real-time insights into how specific mutations influence: -
18
18
19
    1. Metastatic patterns
19
    1. Metastatic patterns
20
20
21
    2. Disease progression timelines
21
    2. Disease progression timelines
22
22
23
    3. Survival outcomes
23
    3. Survival outcomes
24
24
25
    4. Therapeutic vulnerabilities
25
    4. Therapeutic vulnerabilities
26
26
27
By making these correlations visible at the point of care, the platform accelerates the identification of novel disease variants and enables more precise, personalized treatment strategies - bridging the gap between genomic research and clinical practice.
27
By making these correlations visible at the point of care, the platform accelerates the identification of novel disease variants and enables more precise, personalized treatment strategies - bridging the gap between genomic research and clinical practice.
28
28
29
#### Microsoft Fabric revolutionizes this process by providing:
29
#### Microsoft Fabric revolutionizes this process by providing:
30
30
31
Unified Data Lakehouse for harmonizing sequencing data, clinical records, and research repositories
31
Unified Data Lakehouse for harmonizing sequencing data, clinical records, and research repositories
32
32
33
Low-Code Transformations to clean and standardize omics data without extensive coding
33
Low-Code Transformations to clean and standardize omics data without extensive coding
34
34
35
Built-In ML Capabilities for running predictive models directly on Fabric notebooks
35
Built-In ML Capabilities for running predictive models directly on Fabric notebooks
36
36
37
#### Members/Contributors
37
#### Members/Contributors
38
     1. Daniel Muthama (ML and Backend)
38
     1. Daniel Muthama (ML and Backend)
39
     2. Eunice Nduku (Data)
39
     2. Eunice Nduku (Data)
40
     3. Daniel Muruthi (Frontend)
40
     3. Daniel Muruthi (Frontend)
41
41
42
42
43
### Overview
43
### Overview
44
44
45
This AI-powered oncology platform analyzes integrated genomic, proteomic, and metabolomic data to predict disease outcomes and generate personalized treatment recommendations. The system utilizes Microsoft Fabric for multimodal data orchestration, Azure AI Search for biomedical evidence retrieval, and Azure OpenAI for clinical insights generation. Specifically designed for breast cancer research, it identifies pathogenic mutation patterns, detects clinically significant genomic variants, and synthesizes comprehensive reports highlighting therapeutic implications derived from multi-omics analysis.
45
This AI-powered oncology platform analyzes integrated genomic, proteomic, and metabolomic data to predict disease outcomes and generate personalized treatment recommendations. The system utilizes Microsoft Fabric for multimodal data orchestration, Azure AI Search for biomedical evidence retrieval, and Azure OpenAI for clinical insights generation. Specifically designed for breast cancer research, it identifies pathogenic mutation patterns, detects clinically significant genomic variants, and synthesizes comprehensive reports highlighting therapeutic implications derived from multi-omics analysis.
46
46
47
#### Key improvements:
47
#### Key improvements:
48
48
49
Precision in Terminology: Changed "disease recovery" to "disease outcomes" (more clinically accurate)
49
Precision in Terminology: Changed "disease recovery" to "disease outcomes" (more clinically accurate)
50
50
51
**Oncology Focus:** Added "specifically designed for breast cancer research"
51
**Oncology Focus:** Added "specifically designed for breast cancer research"
52
52
53
**Technical Clarity:** Specified "pathogenic mutation patterns" and "clinically significant genomic variants"
53
**Technical Clarity:** Specified "pathogenic mutation patterns" and "clinically significant genomic variants"
54
54
55
**Flow:** Improved logical progression from data types → analysis → clinical outputs
55
**Flow:** Improved logical progression from data types → analysis → clinical outputs
56
56
57
**Professional Tone:** Used terms like "therapeutic implications" and "multi-omics analysis
57
**Professional Tone:** Used terms like "therapeutic implications" and "multi-omics analysis
58
58
59
### Project Flow
59
60
60
61
<p align="center">
61
#### Services
62
  <img src="output/x.jpg" alt="High-Level Architecture Diagram" width="1000">
62
63
  <br>
63
    1. ML Model - Training/ Classification/ Fine-tuning
64
  <em>Figure 1: High-level architecture of the bioscience platform</em>
64
    2. RAG System - Bioscince RAG system
65
</p>
65
    3. Data Collector - breast_cancer_recorder
66
66
    4. Visualizations
67
#### Services
67
68
68
### Project Structure
69
    1. ML Model - Training/ Classification/ Fine-tuning
69
70
    2. RAG System - Bioscince RAG system
70
    GenomicAnalysisWorkspace/
71
    3. Data Collector - breast_cancer_recorder
71
72
    4. Visualizations
72
    ├── BioEventHouse/                     # Eventhouse and KQL Database for genomic events
73
73
    │   ├── (Eventhouse data)
74
### Project Structure
74
    │   └── (KQL Database)
75
75
76
    GenomicAnalysisWorkspace/
76
    ├── BioEventHouse_queryset/            # KQL Queryset for querying genomic events
77
77
78
    ├── BioEventHouse/                     # Eventhouse and KQL Database for genomic events
78
    ├── Biospecimen_RAG_System/            # Notebook for biospecimen RAG (Retrieval-Augmented Generation) system
79
    │   ├── (Eventhouse data)
79
80
    │   └── (KQL Database)
80
    ├── Biospecimen_Report_Generator/      # Notebook for generating biospecimen reports
81
81
82
    ├── BioEventHouse_queryset/            # KQL Queryset for querying genomic events
82
    ├── BiospecimenClassifier/             # Machine learning model for biospecimen classification
83
83
84
    ├── Biospecimen_RAG_System/            # Notebook for biospecimen RAG (Retrieval-Augmented Generation) system
84
    ├── Data_Engineering/                  # Notebook for data engineering tasks
85
85
86
    ├── Biospecimen_Report_Generator/      # Notebook for generating biospecimen reports
86
    ├── Genomel_H/                         # Lakehouse for genomic data
87
87
    │   ├── (Lakehouse data)
88
    ├── BiospecimenClassifier/             # Machine learning model for biospecimen classification
88
    │   ├── Semantic model
89
89
    │   └── SQL analytics endpoint
90
    ├── Data_Engineering/                  # Notebook for data engineering tasks
90
91
91
    ├── GenomicAnalysisPipeline/           # Notebook and experiment for genomic analysis
92
    ├── Genomel_H/                         # Lakehouse for genomic data
92
    │   ├── (Notebook)
93
    │   ├── (Lakehouse data)
93
    │   └── (Experiment)
94
    │   ├── Semantic model
94
95
    │   └── SQL analytics endpoint
95
    ├── GenomicDataProcessing/             # Notebook for genomic data processing
96
96
97
    ├── GenomicAnalysisPipeline/           # Notebook and experiment for genomic analysis
97
    └── model_deployment/                  # Notebook and experiment for model deployment
98
    │   ├── (Notebook)
98
        ├── (Notebook)
99
    │   └── (Experiment)
99
        └── (Experiment)
100
100
101
    ├── GenomicDataProcessing/             # Notebook for genomic data processing
101
102
102
103
    └── model_deployment/                  # Notebook and experiment for model deployment
103
104
        ├── (Notebook)
104
##### Downloaded File: "breast_cancer_patients.csv"
105
        └── (Experiment)
105
106
106
#### File Structure
107
<p align="center">
107
108
  <img src="output/x4.png" alt="Diagram 4">
108
    breast-cancer-rag-app/
109
  <br>
109
    ├── backend/
110
  <em>Figure 4: MS Fabric - GenomicAnalysisWorkspace</em>
110
    │   ├── __pycache__/          # Python cached bytecode
111
</p>
111
    │   ├── venv/                 # Python virtual environment
112
112
    │   ├── .env                  # Environment variables
113
<p align="center">
113
    │   ├── biospecimen_rag.py    # Core RAG implementation
114
  <img src="output/x5.png" alt="Diagram 5">
114
    │   ├── Dockerfile            # Backend container configuration
115
  <br>
115
    │   ├── main.py               # FastAPI entry point
116
  <em>Figure 5: MS Fabric - GenomicAnalysisWorkspace2</em>
116
    │   └── requirements.txt      # Python dependencies
117
</p>
117
118
118
    └── frontend/
119
119
        ├── node_modules/         # NPM packages
120
#### Breast Cancer Insight Analysis
120
        ├── public/               # Static assets
121
<p align="center">
121
        ├── src/
122
  <img src="output/x2.png" alt="Diagram 1" width="1000">
122
        │   ├── components/
123
  <br>
123
        │   │   ├── QueryInterface.js  # Main query component
124
  <em>Figure 2: Breast Cancer Recorder</em>
124
        │   │   ├── ResultsViewer.js   # Results display
125
</p>
125
        │   │   └── StatusIndicator.js # System status UI
126
126
        │   ├── services/
127
<p align="center">
127
        │   │   └── apiService.js      # API communication
128
  <img src="output/x3.png" alt="Diagram 2">
128
        │   ├── App.js           # Root React component
129
  <br>
129
        │   ├── index.js         # React entry point
130
  <em>Figure 3: Breast Cancer Recorder with Records</em>
130
        │   ├── reportWebVitals.js # Performance tracking
131
</p>
131
        │   ├── styles.css       # Global styles
132
132
        │   └── .gitignore       # Frontend ignore rules
133
##### Downloaded File: "breast_cancer_patients.csv"
133
        ├── Dockerfile           # Frontend container config
134
134
        ├── package-lock.json    # Exact dependency tree
135
#### File Structure
135
        └── package.json         # Project metadata (implied)
136
136
137
    breast-cancer-rag-app/
137
138
    ├── backend/
138
139
    │   ├── __pycache__/          # Python cached bytecode
139
140
    │   ├── venv/                 # Python virtual environment
140
141
    │   ├── .env                  # Environment variables
141
##### For Backend service to work need to update: -
142
    │   ├── biospecimen_rag.py    # Core RAG implementation
142
143
    │   ├── Dockerfile            # Backend container configuration
143
            OPENAI_GPT4_DEPLOYMENT="gpt-4"
144
    │   ├── main.py               # FastAPI entry point
144
            OPENAI_ENDPOINT="https://<your-resource>.openai.azure.com"
145
    │   └── requirements.txt      # Python dependencies
145
            OPENAI_API_KEY="<your-key>"
146
146
            OPENAI_ADA_DEPLOYMENT="text-embedding-ada-002"
147
    └── frontend/
147
148
        ├── node_modules/         # NPM packages
148
            KUSTO_URI="https://trd-zdxwqrcu1znbqygpxg.z2.kusto.fabric.microsoft.com"
149
        ├── public/               # Static assets
149
            KUSTO_DATABASE="BioEventHouse"
150
        ├── src/
150
            KUSTO_TABLE="biospecimen_embeddings"
151
        │   ├── components/
151
152
        │   │   ├── QueryInterface.js  # Main query component
152
            AZURE_TENANT_ID="<tenant-id>"
153
        │   │   ├── ResultsViewer.js   # Results display
153
            AZURE_CLIENT_ID="<client-id>"
154
        │   │   └── StatusIndicator.js # System status UI
154
            AZURE_CLIENT_SECRET="<client-secret>"
155
        │   ├── services/
155
156
        │   │   └── apiService.js      # API communication
156
            APP_INSIGHTS_KEY=your-instrumentation-key
157
        │   ├── App.js           # Root React component
157
            SECRET_KEY=your-secret-key-for-flask
158
        │   ├── index.js         # React entry point
158
159
        │   ├── reportWebVitals.js # Performance tracking
159
#### File Structure
160
        │   ├── styles.css       # Global styles
160
161
        │   └── .gitignore       # Frontend ignore rules
161
162
        ├── Dockerfile           # Frontend container config
162
    breast-cancer-recorder/
163
        ├── package-lock.json    # Exact dependency tree
163
    ├── build/
164
        └── package.json         # Project metadata (implied)
164
    ├── static/
165
165
    ├── asset-manifest.json
166
<p align="center">
166
    ├── index.html
167
  <img src="output/x10.png" alt="Diagram 5">
167
    ├── manifest.json
168
  <br>
168
    ├── robots.txt
169
  <em>Figure 10: MS Fabric - Frontend</em>
169
    ├── node_modules/
170
</p>
170
    ├── public/
171
171
    │   ├── index.html
172
<p align="center">
172
    │   ├── manifest.json
173
  <img src="output/x11.png" alt="Diagram 5">
173
    │   ├── robots.txt
174
  <br>
174
    │   └── reports/                      # New directory for research reports
175
  <em>Figure 11: MS Fabric - Backend</em>
175
    │       └── breast_cancer_report.md   # Added markdown report
176
</p>
176
    ├── src/
177
177
    │   ├── components/
178
<p align="center">
178
    │   │   ├── DataTable.js
179
  <img src="output/x18.png" alt="Diagram 5">
179
    │   │   └── PatientForm.js
180
  <br>
180
    │   ├── utils/
181
  <em>Figure 18: MS Fabric - Frontend for a Backend(service working)</em>
181
    │   ├── App.js
182
</p>
182
    │   ├── index.js
183
183
    │   ├── reportWebVitals.js
184
##### For Backend service to work need to update: -
184
    │   ├── styles.css
185
185
    │   └── report-components/            # New directory for report components
186
            OPENAI_GPT4_DEPLOYMENT="gpt-4"
186
    │       ├── ReportViewer.js           # Component to display the markdown
187
            OPENAI_ENDPOINT="https://<your-resource>.openai.azure.com"
187
    │       └── ReportGenerator.js        # Component to generate dynamic reports
188
            OPENAI_API_KEY="<your-key>"
188
    ├── .env
189
            OPENAI_ADA_DEPLOYMENT="text-embedding-ada-002"
189
    ├── .gitignore
190
190
    ├── package.json
191
            KUSTO_URI="https://trd-zdxwqrcu1znbqygpxg.z2.kusto.fabric.microsoft.com"
191
    └── vercel.json
192
            KUSTO_DATABASE="BioEventHouse"
192
193
            KUSTO_TABLE="biospecimen_embeddings"
193
The above React application collects breast cancer patient data including location, age, cancer stage, weight, and email. It features form validation to ensure accurate data entry, with age restricted to 18-120 years, weight validated as positive numbers, and proper email formatting. . Collected data can be downloaded as CSV for analysis. The app includes error handling and success notifications. The clean interface prioritizes usability while maintaining data integrity, making it suitable for medical professionals to record and manage patient information efficiently.
194
194
195
            AZURE_TENANT_ID="<tenant-id>"
195
196
            AZURE_CLIENT_ID="<client-id>"
196
#### Data Wrangling and PowerBI Analysis
197
            AZURE_CLIENT_SECRET="<client-secret>"
197
198
198
This Medallion pipeline ingests raw breast cancer patient data (bronze), cleans/validates it (silver), and enriches with analytical features (gold) in Fabric's Lakehouse. The process includes data type conversion, email validation, weight normalization, and risk categorization. PowerBI connects via Direct Lake for real-time visualization of age groups, cancer stages, and geographic distributions. For long-term storage, gold data exports to AWS S3 in Parquet format. The pipeline enables specialists to identify patterns and generate personalized prevention advice based on historical correlations between patient demographics and cancer progression. Microsoft Fabric streamlines this workflow with integrated Spark processing, Delta Lake storage, and PowerBI analytics in one platform.
199
            APP_INSIGHTS_KEY=your-instrumentation-key
199
200
            SECRET_KEY=your-secret-key-for-flask
200
201
201
202
#### File Structure
202
203
203
204
204
### Breast Cancer Patient Analytics Report  
205
    breast-cancer-recorder/
205
206
    ├── build/
206
*Generated from Lakehouse Pipeline – {{date}}*  
207
    ├── static/
207
208
    ├── asset-manifest.json
208
#### 1. Key Demographics  
209
    ├── index.html
209
210
    ├── manifest.json
210
- **Total Patients:** `{{gold_df.count()}}`  
211
    ├── robots.txt
211
- **Average Age:** `{{stage_analysis_df.select(avg("avg_age")).first()[0]}}` years  
212
    ├── node_modules/
212
- **Weight Distribution:**  
213
    ├── public/
213
214
    │   ├── index.html
214
  - Mean: `{{stage_analysis_df.select(avg("avg_weight")).first()[0]}}` kg  
215
    │   ├── manifest.json
215
  - Std Dev: `{{stage_analysis_df.select(stddev("avg_weight")).first()[0]}}` kg  
216
    │   ├── robots.txt
216
217
    │   └── reports/                      # New directory for research reports
217
#### 2. Stage Distribution  
218
    │       └── breast_cancer_report.md   # Added markdown report
218
219
    ├── src/
219
| Stage | Patients (%) | Avg Age | Top Location |  
220
    │   ├── components/
220
|-------|-------------|---------|--------------|  
221
    │   │   ├── DataTable.js
221
| 1     | `{{count_stage1/total*100}}`% | `{{age_stage1}}` | `{{top_loc_stage1}}` |  
222
    │   │   └── PatientForm.js
222
| 2     | `{{count_stage2/total*100}}`% | `{{age_stage2}}` | `{{top_loc_stage2}}` |  
223
    │   ├── utils/
223
| ...   | ...         | ...     | ...          |  
224
    │   ├── App.js
224
225
    │   ├── index.js
225
**Insight:** Early-stage (1-2) diagnoses are most prevalent in `{{top_location}}`.  
226
    │   ├── reportWebVitals.js
226
227
    │   ├── styles.css
227
#### 3. Temporal Trends  
228
    │   └── report-components/            # New directory for report components
228
229
    │       ├── ReportViewer.js           # Component to display the markdown
229
![Diagnosis Over Time]  
230
    │       └── ReportGenerator.js        # Component to generate dynamic reports
230
- **Peak Diagnoses:** `{{year_with_max_cases}}`  
231
    ├── .env
231
- **Recent Change:** `{{last_3_years_trend}}` (↑/↓)  
232
    ├── .gitignore
232
233
    ├── package.json
233
#### 4. AI-Generated Prevention Insights  
234
    └── vercel.json
234
235
235
**For Stage {{X}} Patients (Age {{Y}}):**  
236
The above React application collects breast cancer patient data including location, age, cancer stage, weight, and email. It features form validation to ensure accurate data entry, with age restricted to 18-120 years, weight validated as positive numbers, and proper email formatting. . Collected data can be downloaded as CSV for analysis. The app includes error handling and success notifications. The clean interface prioritizes usability while maintaining data integrity, making it suitable for medical professionals to record and manage patient information efficiently.
236
> "Patients at this stage should prioritize {{GPT-4_advice}}..."
237
237
238
238
#### 5. Data Quality Notes  
239
#### Data Wrangling and PowerBI Analysis
239
240
240
- **Complete Records:** `{{valid_records/total*100}}`%  
241
This Medallion pipeline ingests raw breast cancer patient data (bronze), cleans/validates it (silver), and enriches with analytical features (gold) in Fabric's Lakehouse. The process includes data type conversion, email validation, weight normalization, and risk categorization. PowerBI connects via Direct Lake for real-time visualization of age groups, cancer stages, and geographic distributions. For long-term storage, gold data exports to AWS S3 in Parquet format. The pipeline enables specialists to identify patterns and generate personalized prevention advice based on historical correlations between patient demographics and cancer progression. Microsoft Fabric streamlines this workflow with integrated Spark processing, Delta Lake storage, and PowerBI analytics in one platform.
241
- **Missing Data:**  
242
242
  - `{{null_cancer_stage}}` missing stage labels  
243
<p align="center">
243
  - `{{null_weight}}` missing weight entries  
244
  <img src="output/x6.png" alt="Diagram 5">
244
245
  <br>
245
### Methodology  
246
  <em>Figure 6: MS Fabric - Breast_Cancer_LakeHouse</em>
246
247
</p>
247
- **Data Source:** `breast_cancer_patients.csv`  
248
248
- **Pipeline:**  
249
<p align="center">
249
  - **Bronze:** Raw ingestion  
250
  <img src="output/x7.png" alt="Diagram 5">
250
  - **Silver:** PII pseudonymization + cleaning  
251
  <br>
251
  - **Gold:** Analytics + AI enrichment (GPT-4)  
252
  <em>Figure 7: MS Fabric - Breast_Cancer_LakeHouse</em>
252
- **Tools:** Microsoft Fabric, Power BI, PySpark  
253
</p>
253
254
254
255
<p align="center">
255
### Key Components
256
  <img src="output/x8.png" alt="Diagram 5">
256
257
  <br>
257
### Data Storage
258
  <em>Figure 8: MS Fabric - Breast_Cancer_LakeHouse</em>
258
- **BioEventHouse**: Eventhouse and KQL Database for genomic event data  
259
</p>
259
- **Genomel_H**: Lakehouse for genomic data with semantic model and SQL analytics  
260
260
261
<p align="center">
261
### Analysis Tools
262
  <img src="output/x9.png" alt="Diagram 5">
262
- Multiple Jupyter notebooks for various genomic analysis tasks  
263
  <br>
263
- Experiments tracking for machine learning workflows  
264
  <em>Figure 9: MS Fabric - Breast_Cancer_LakeHouse</em>
264
265
</p>
265
The provided observations outline a genomic machine learning pipeline leveraging MLflow for model management and reproducibility. Key aspects include:
266
266
267
267
#### Pipeline Structure
268
### Breast Cancer Patient Analytics Report  
268
269
269
Artifact Tracking: Model files (model.pkl), environment specifications (conda.yaml, python_env.yaml), and evaluation metrics (ROC curves, confusion matrices) are systematically logged, ensuring reproducibility.
270
*Generated from Lakehouse Pipeline – {{date}}*  
270
271
271
Runtime Metrics: Training metrics (accuracy, F1-score, recall) are tracked, emphasizing model performance validation for genomic data classification tasks.
272
#### 1. Key Demographics  
272
273
273
274
- **Total Patients:** `{{gold_df.count()}}`  
274
275
- **Average Age:** `{{stage_analysis_df.select(avg("avg_age")).first()[0]}}` years  
275
#### MLflow Integration
276
- **Weight Distribution:**  
276
277
277
The MLmodel file defines metadata for model deployment, including:
278
  - Mean: `{{stage_analysis_df.select(avg("avg_weight")).first()[0]}}` kg  
278
279
  - Std Dev: `{{stage_analysis_df.select(stddev("avg_weight")).first()[0]}}` kg  
279
**Dependencies:** Conda/virtualenv environments to replicate training conditions.
280
280
281
#### 2. Stage Distribution  
281
**Model Specifications:** Scikit-learn flavor with input (21 features as float64) and output (int64 labels) schemas, tailored for genomic datasets.
282
282
283
| Stage | Patients (%) | Avg Age | Top Location |  
283
**Version Control:** Explicit library versions (sklearn 1.2.2, MLflow 2.12.2) prevent dependency conflicts.
284
|-------|-------------|---------|--------------|  
284
285
| 1     | `{{count_stage1/total*100}}`% | `{{age_stage1}}` | `{{top_loc_stage1}}` |  
285
#### Workflow Efficiency
286
| 2     | `{{count_stage2/total*100}}`% | `{{age_stage2}}` | `{{top_loc_stage2}}` |  
286
287
| ...   | ...         | ...     | ...          |  
287
Unique run_id and experiment IDs enable traceability across genomic analyses.
288
288
289
**Insight:** Early-stage (1-2) diagnoses are most prevalent in `{{top_location}}`.  
289
**Implications:** This setup ensures reproducibility (via environment isolation), scalability (through MLflow’s tracking), and interpretability (via visualized metrics), addressing common challenges in genomic ML workflows. The focus on structured metadata and standardized evaluation aligns with best practices for translational bioinformatics.
290
290
291
#### 3. Temporal Trends  
291
292
292
293
![Diagnosis Over Time]  
293
294
- **Peak Diagnoses:** `{{year_with_max_cases}}`  
294
### Machine Learning
295
- **Recent Change:** `{{last_3_years_trend}}` (↑/↓)  
295
- **BiospecimenClassifier**: ML model for biospecimen classification  
296
296
- Model deployment experiments  
297
#### 4. AI-Generated Prevention Insights  
297
298
298
### Reporting
299
**For Stage {{X}} Patients (Age {{Y}}):**  
299
- **Biospecimen_RAG_System**: Retrieval-Augmented Generation system  
300
> "Patients at this stage should prioritize {{GPT-4_advice}}..."
300
- **Biospecimen_Report_Generator**: Automated report generation  
301
301
302
#### 5. Data Quality Notes  
302
### Setup Instructions
303
303
304
- **Complete Records:** `{{valid_records/total*100}}`%  
304
#### 1. Prerequisites
305
- **Missing Data:**  
305
- **Azure Account**: Access to Microsoft Fabric, Azure AI Search, and Azure OpenAI.
306
  - `{{null_cancer_stage}}` missing stage labels  
306
- **Python 3.8+**: Install Python and required libraries.
307
  - `{{null_weight}}` missing weight entries  
307
- **Power BI Desktop**: For creating visualizations.
308
308
- **Microsoft Fabric Workspace**: With contributor permissions.
309
### Methodology  
309
- **Genomic Datasets**: Access to required genomic data sources.
310
310
311
- **Data Source:** `breast_cancer_patients.csv`  
311
#### 2. Install Dependencies
312
- **Pipeline:**  
312
Install the required Python libraries:
313
  - **Bronze:** Raw ingestion  
313
314
  - **Silver:** PII pseudonymization + cleaning  
314
### 3. Configure Azure Resources
315
  - **Gold:** Analytics + AI enrichment (GPT-4)  
315
#### Microsoft Fabric:
316
- **Tools:** Microsoft Fabric, Power BI, PySpark  
316
317
317
Create a Fabric workspace and set up OneLake.
318
318
319
### Key Components
319
#### Azure OpenAI:
320
320
321
### Data Storage
321
Set up an OpenAI resource and deploy a GPT-4 model.
322
- **BioEventHouse**: Eventhouse and KQL Database for genomic event data  
322
323
- **Genomel_H**: Lakehouse for genomic data with semantic model and SQL analytics  
323
### 4. Update Configuration
324
324
Replace placeholders (e.g., <api_key>, <connection_string>) in the code with your Azure resource details.
325
### Analysis Tools
325
326
- Multiple Jupyter notebooks for various genomic analysis tasks  
326
### Visualization
327
- Experiments tracking for machine learning workflows  
327
Use Power BI to create interactive dashboards for visualizing:
328
328
329
The provided observations outline a genomic machine learning pipeline leveraging MLflow for model management and reproducibility. Key aspects include:
329
Molecular phenotyping profiles.
330
330
331
#### Pipeline Structure
331
Top biomarkers for disease recovery.
332
332
333
Artifact Tracking: Model files (model.pkl), environment specifications (conda.yaml, python_env.yaml), and evaluation metrics (ROC curves, confusion matrices) are systematically logged, ensuring reproducibility.
333
Trends in gene expression.
334
334
335
Runtime Metrics: Training metrics (accuracy, F1-score, recall) are tracked, emphasizing model performance validation for genomic data classification tasks.
335
### Contributing
336
336
Contributions are welcome! Please follow these steps:
337
<p align="center">
337
338
  <img src="output/x16.png" alt="Diagram 5">
338
### Fork the repository.
339
  <br>
339
340
  <em>Figure 16: MS Fabric - GenomicAnalysisWorkspace</em>
340
**Repository:** [https://github.com/danielmuthama23/Genomic_Analysis.git](#)  
341
</p>
341
342
342
343
<p align="center">
343
### Summary
344
  <img src="output/x17.png" alt="Diagram 5">
344
345
  <br>
345
#### 1. Genomic Analysis Report: Mutation-Disease Association Detection
346
  <em>Figure 17: MS Fabric - GenomicAnalysisWorkspace</em>
346
347
</p>
347
This report summarizes findings from genomic data analysis, focusing on detecting disease associations through mutation patterns in breast cancer samples.  
348
348
349
349
**Analysis Methods:**  
350
#### MLflow Integration
350
351
351
- Mutation frequency analysis of key cancer genes  
352
The MLmodel file defines metadata for model deployment, including:
352
- Protein-protein interaction networks to identify functional clusters  
353
353
- Metabolic pathway mapping to detect dysregulated processes  
354
**Dependencies:** Conda/virtualenv environments to replicate training conditions.
354
355
355
**Key Datasets:**  
356
**Model Specifications:** Scikit-learn flavor with input (21 features as float64) and output (int64 labels) schemas, tailored for genomic datasets.
356
357
357
- `PDC_biospecimen_manifest_03272025_214257.csv`  
358
**Version Control:** Explicit library versions (sklearn 1.2.2, MLflow 2.12.2) prevent dependency conflicts.
358
- Embedded mock genomic data for test and validation  
359
359
360
#### Workflow Efficiency
360
---
361
361
362
Unique run_id and experiment IDs enable traceability across genomic analyses.
362
### 2. Key Findings  
363
363
364
**Implications:** This setup ensures reproducibility (via environment isolation), scalability (through MLflow’s tracking), and interpretability (via visualized metrics), addressing common challenges in genomic ML workflows. The focus on structured metadata and standardized evaluation aligns with best practices for translational bioinformatics.
364
#### 2.1 Mutation-Disease Associations 
365
365
366
<p align="center">
366
![Mutation Counts]
367
  <img src="output/x12.png" alt="Diagram 5">
367
368
  <br>
368
**Top Pathogenic Mutations:**  
369
  <em>Figure 12: MS Fabric - Genomic Analysis Pipeline</em>
369
370
</p>
370
| Gene    | Mutation Count | Disease-Associated | Percentage |  
371
371
|---------|---------------|--------------------|------------|  
372
<p align="center">
372
| TP53    | 8             | 8                  | 100%       |  
373
  <img src="output/x13.png" alt="Diagram 5">
373
| PIK3CA  | 5             | 5                  | 100%       |  
374
  <br>
374
| BRCA1   | 4             | 4                  | 100%       |  
375
  <em>Figure 13: MS Fabric - Genomic Analysis Pipeline</em>
375
376
</p>
376
**Insights:**  
377
377
378
<p align="center">
378
- **TP53 mutations** were ubiquitous (100% disease-linked), indicating its role as a primary driver.  
379
  <img src="output/x14.png" alt="Diagram 5">
379
- **PIK3CA** and **BRCA1/2** mutations showed strong disease associations.  
380
  <br>
380
381
  <em>Figure 14: MS Fabric - Genomic Analysi.s Pipeline</em>
381
---
382
</p>
382
383
383
384
<p align="center">
384
385
  <img src="output/x15.png" alt="Diagram 5">
385
#### 2.2 Protein Interaction Network  
386
  <br>
386
387
  <em>Figure 15: MS Fabric - Genomic Analysis Pipeline</em>
387
![Protein Network]
388
</p>
388
389
389
**Critical Hubs (High Connectivity):**  
390
390
391
### Machine Learning
391
1. **TP53** (4 interactions)  
392
- **BiospecimenClassifier**: ML model for biospecimen classification  
392
2. **BRCA1** (3 interactions)  
393
- Model deployment experiments  
393
3. **PIK3CA** (3 interactions)  
394
394
395
### Reporting
395
**Key Observations:**  
396
- **Biospecimen_RAG_System**: Retrieval-Augmented Generation system  
396
397
- **Biospecimen_Report_Generator**: Automated report generation  
397
- Red nodes (PDC-identified proteins) formed central hubs.  
398
398
- Green edges (activation) dominated oncogenic pathways (e.g., PIK3CA→AKT1).  
399
### Setup Instructions
399
400
400
---
401
#### 1. Prerequisites
401
402
- **Azure Account**: Access to Microsoft Fabric, Azure AI Search, and Azure OpenAI.
402
403
- **Python 3.8+**: Install Python and required libraries.
403
404
- **Power BI Desktop**: For creating visualizations.
404
#### 2.3 Metabolic Pathway Dysregulation  
405
- **Microsoft Fabric Workspace**: With contributor permissions.
405
406
- **Genomic Datasets**: Access to required genomic data sources.
406
![Metabolic Pathways]  
407
407
408
#### 2. Install Dependencies
408
**Most Dysregulated Pathways:**  
409
Install the required Python libraries:
409
410
410
1. **Glycolysis** (↑ Glucose-6-P, Fructose-1,6-BP)  
411
### 3. Configure Azure Resources
411
2. **TCA Cycle** (↓ Succinyl-CoA, ↑ Acetyl-CoA)  
412
#### Microsoft Fabric:
412
3. **Fatty Acid Synthesis** (↑ Malonyl-CoA)  
413
413
414
Create a Fabric workspace and set up OneLake.
414
**Top Dysregulated Metabolite:**  
415
415
416
#### Azure OpenAI:
416
- **Acetyl-CoA** (2.1-fold change, linked to PTEN mutations).  
417
417
418
Set up an OpenAI resource and deploy a GPT-4 model.
418
---
419
419
420
### 4. Update Configuration
420
421
Replace placeholders (e.g., <api_key>, <connection_string>) in the code with your Azure resource details.
421
422
422
### 3. Disease Detection Methodology  
423
### Visualization
423
424
Use Power BI to create interactive dashboards for visualizing:
424
#### 3.1 Mutation-Based Detection 
425
425
426
Molecular phenotyping profiles.
426
- **Thresholds:** Genes with >70% disease-associated mutations flagged as high-risk.  
427
427
- **Validation:** Cross-referenced with COSMIC database.  
428
Top biomarkers for disease recovery.
428
429
429
#### 3.2 Network Analysis  
430
Trends in gene expression.
430
431
431
- Prioritized **hub genes** (e.g., TP53) as biomarkers.  
432
### Contributing
432
- **Inhibition edges** (red) highlighted drug targets (e.g., PTEN→AKT1).  
433
Contributions are welcome! Please follow these steps:
433
434
434
#### 3.3 Metabolic Insights  
435
### Fork the repository.
435
436
436
- Glycolysis/TCA cycle disruptions correlated with TP53/PIK3CA mutations.  
437
**Repository:** [https://github.com/danielmuthama23/Genomic_Analysis.git](#)  
437
- High Acetyl-CoA suggests vulnerability to metabolic inhibitors.  
438
438
439
439
---
440
### Summary
440
441
441
### 4. Conclusions & Recommendations  
442
#### 1. Genomic Analysis Report: Mutation-Disease Association Detection
442
443
443
From the analysis we can conclude:-
444
This report summarizes findings from genomic data analysis, focusing on detecting disease associations through mutation patterns in breast cancer samples.  
444
445
445
**Diagnostic Markers:**  
446
**Analysis Methods:**  
446
447
447
- **TP53 mutations** as universal biomarkers.  
448
- Mutation frequency analysis of key cancer genes  
448
- **PIK3CA activation** signals aggressive subtypes.  
449
- Protein-protein interaction networks to identify functional clusters  
449
450
- Metabolic pathway mapping to detect dysregulated processes  
450
**Therapeutic Targets:**  
451
451
452
**Key Datasets:**  
452
- Target **PIK3CA-AKT1 interactions**.  
453
453
- Explore **metabolic inhibitors** for Acetyl-CoA-overproducing tumors.  
454
- `PDC_biospecimen_manifest_03272025_214257.csv`  
454
455
- Embedded mock genomic data for test and validation  
455
**Future Works:**  
456
456
457
---
457
- Validate with clinical outcomes data.  
458
458
- Expand analysis to RNA-seq.  
459
### 2. Key Findings  
459
460
460
---
461
#### 2.1 Mutation-Disease Associations 
461
462
462
### 5. Files Generated  
463
![Mutation Counts]
463
464
464
| File                          | Description                                  |  
465
**Top Pathogenic Mutations:**  
465
|-------------------------------|----------------------------------------------|  
466
466
| `mutation_disease_counts.png` | Top mutated genes with disease associations. |  
467
| Gene    | Mutation Count | Disease-Associated | Percentage |  
467
| `protein_network.png`         | Protein interaction network with PDC hubs.   |  
468
|---------|---------------|--------------------|------------|  
468
| `metabolic_pathways.png`      | Dysregulated metabolic pathways.             |  
469
| TP53    | 8             | 8                  | 100%       |  
469
470
| PIK3CA  | 5             | 5                  | 100%       |  
470
471
| BRCA1   | 4             | 4                  | 100%       |  
471
---
472
472
473
**Insights:**  
473
**Prepared by:** Daniel Muthama 
474
474
**Date:** April 2, 2025  
475
- **TP53 mutations** were ubiquitous (100% disease-linked), indicating its role as a primary driver.  
475
**Contact:** (mailto:danielmuthama23@gmail.com)  
476
- **PIK3CA** and **BRCA1/2** mutations showed strong disease associations.  
476
477
477
478
---
478
---
479
479
480
<p align="center">
480
### How to Use This Report  
481
  <img src="output/mutation_disease_counts" alt="Diagram 5">
481
482
  <br>
482
- **Clinicians:** Focus on TP53/PIK3CA status for patient stratification.  
483
  <em>Figure 19: MS Fabric - Mutation</em>
483
- **Researchers:** Explore metabolic pathways for novel drug combinations.  
484
</p>
484
- **Data Teams:** Replicate pipeline using `DataEngineering.tex`.  
485
485
486
486
### License
487
#### 2.2 Protein Interaction Network  
487
488
488
This project is licensed under the MIT License. See the LICENSE file for details.
489
![Protein Network]
489
490
490
### Contact
491
**Critical Hubs (High Connectivity):**  
491
492
492
For questions or feedback, please contact:
493
1. **TP53** (4 interactions)  
493
494
2. **BRCA1** (3 interactions)  
494
#### Acknowledgments
495
3. **PIK3CA** (3 interactions)  
495
496
496
Microsoft Fabric for data orchestration.
497
**Key Observations:**  
497
498
498
    Azure AI Search for retrieval.
499
- Red nodes (PDC-identified proteins) formed central hubs.  
500
- Green edges (activation) dominated oncogenic pathways (e.g., PIK3CA→AKT1).  
501
502
---
503
504
<p align="center">
505
  <img src="output/protein_network" alt="Diagram 5">
506
  <br>
507
  <em>Figure 21: MS Fabric - Mutation</em>
508
</p>
509
510
#### 2.3 Metabolic Pathway Dysregulation  
511
512
![Metabolic Pathways]  
513
514
**Most Dysregulated Pathways:**  
515
516
1. **Glycolysis** (↑ Glucose-6-P, Fructose-1,6-BP)  
517
2. **TCA Cycle** (↓ Succinyl-CoA, ↑ Acetyl-CoA)  
518
3. **Fatty Acid Synthesis** (↑ Malonyl-CoA)  
519
520
**Top Dysregulated Metabolite:**  
521
522
- **Acetyl-CoA** (2.1-fold change, linked to PTEN mutations).  
523
524
---
525
526
<p align="center">
527
  <img src="output/metabolic_pathways" alt="Diagram 5">
528
  <br>
529
  <em>Figure 20: MS Fabric - Metabolic Pathway</em>
530
</p>
531
532
### 3. Disease Detection Methodology  
533
534
#### 3.1 Mutation-Based Detection 
535
536
- **Thresholds:** Genes with >70% disease-associated mutations flagged as high-risk.  
537
- **Validation:** Cross-referenced with COSMIC database.  
538
539
#### 3.2 Network Analysis  
540
541
- Prioritized **hub genes** (e.g., TP53) as biomarkers.  
542
- **Inhibition edges** (red) highlighted drug targets (e.g., PTEN→AKT1).  
543
544
#### 3.3 Metabolic Insights  
545
546
- Glycolysis/TCA cycle disruptions correlated with TP53/PIK3CA mutations.  
547
- High Acetyl-CoA suggests vulnerability to metabolic inhibitors.  
548
549
---
550
551
### 4. Conclusions & Recommendations  
552
553
From the analysis we can conclude:-
554
555
**Diagnostic Markers:**  
556
557
- **TP53 mutations** as universal biomarkers.  
558
- **PIK3CA activation** signals aggressive subtypes.  
559
560
**Therapeutic Targets:**  
561
562
- Target **PIK3CA-AKT1 interactions**.  
563
- Explore **metabolic inhibitors** for Acetyl-CoA-overproducing tumors.  
564
565
**Future Works:**  
566
567
- Validate with clinical outcomes data.  
568
- Expand analysis to RNA-seq.  
569
570
---
571
572
### 5. Files Generated  
573
574
| File                          | Description                                  |  
575
|-------------------------------|----------------------------------------------|  
576
| `mutation_disease_counts.png` | Top mutated genes with disease associations. |  
577
| `protein_network.png`         | Protein interaction network with PDC hubs.   |  
578
| `metabolic_pathways.png`      | Dysregulated metabolic pathways.             |  
579
580
581
---
582
583
**Prepared by:** Daniel Muthama 
584
**Date:** April 2, 2025  
585
**Contact:** (mailto:danielmuthama23@gmail.com)  
586
587
588
---
589
590
### How to Use This Report  
591
592
- **Clinicians:** Focus on TP53/PIK3CA status for patient stratification.  
593
- **Researchers:** Explore metabolic pathways for novel drug combinations.  
594
- **Data Teams:** Replicate pipeline using `DataEngineering.tex`.  
595
596
### License
597
598
This project is licensed under the MIT License. See the LICENSE file for details.
599
600
### Contact
601
602
For questions or feedback, please contact:
603
604
#### Acknowledgments
605
606
Microsoft Fabric for data orchestration.
607
608
    Azure AI Search for retrieval.
609
    Azure OpenAI for natural language generation.
499
    Azure OpenAI for natural language generation.