Diff of /README.md [000000] .. [2ff6cc]

Switch to unified view

a b/README.md
1
### GenoHos: AI-Powered Genomic Analysis with Molecular Phenotyping and RAG Chat Interface
2
3
With the rapid advancement of multi-omics technologies, healthcare institutions now generate vast amounts of genomic, proteomic, and metabolomic data. While these datasets hold the key to personalized medicine, they typically reside in disconnected systems - from sequencing machines to EHRs to research databases. Traditionally, integrating this data required teams of bioinformaticians to build complex analysis pipelines, slowing down critical treatment decisions.
4
5
By leveraging mutation Data has led to discovery of novel disease variants. For example breast cancer patient management system serves as a powerful platform for physicians to discover and analyze novel disease variants by integrating clinical and genomic data. By systematically recording patient demographics, treatment outcomes, and mutation profiles, the system enables:
6
7
**Pattern Recognition** - Identifies recurring mutation clusters across specific age groups, ethnicities, and geographic locations, revealing potential founder mutations or environmental risk factors.
8
9
**Variant Alert System** - Automatically flags novel variants and compares them against global databases (COSMIC, ClinVar), highlighting mutations with predicted clinical significance.
10
11
**Treatment Response Analysis** - Correlates specific mutations with drug efficacy, helping physicians identify biomarkers for treatment resistance or sensitivity.
12
13
**Collaborative Research** - Facilitates secure data sharing across institutions, creating a crowdsourced knowledge base for rare variants and atypical presentations.
14
15
**Predictive Modeling** - Uses accumulated data to forecast disease progression patterns and suggest personalized therapeutic approaches based on mutation profiles.
16
17
The system transforms routine clinical documentation into a dynamic discovery tool, where each new patient record contributes to our understanding of breast cancer heterogeneity. Physicians gain real-time insights into how specific mutations influence: -
18
19
    1. Metastatic patterns
20
21
    2. Disease progression timelines
22
23
    3. Survival outcomes
24
25
    4. Therapeutic vulnerabilities
26
27
By making these correlations visible at the point of care, the platform accelerates the identification of novel disease variants and enables more precise, personalized treatment strategies - bridging the gap between genomic research and clinical practice.
28
29
#### Microsoft Fabric revolutionizes this process by providing:
30
31
Unified Data Lakehouse for harmonizing sequencing data, clinical records, and research repositories
32
33
Low-Code Transformations to clean and standardize omics data without extensive coding
34
35
Built-In ML Capabilities for running predictive models directly on Fabric notebooks
36
37
#### Members/Contributors
38
     1. Daniel Muthama (ML and Backend)
39
     2. Eunice Nduku (Data)
40
     3. Daniel Muruthi (Frontend)
41
42
43
### Overview
44
45
This AI-powered oncology platform analyzes integrated genomic, proteomic, and metabolomic data to predict disease outcomes and generate personalized treatment recommendations. The system utilizes Microsoft Fabric for multimodal data orchestration, Azure AI Search for biomedical evidence retrieval, and Azure OpenAI for clinical insights generation. Specifically designed for breast cancer research, it identifies pathogenic mutation patterns, detects clinically significant genomic variants, and synthesizes comprehensive reports highlighting therapeutic implications derived from multi-omics analysis.
46
47
#### Key improvements:
48
49
Precision in Terminology: Changed "disease recovery" to "disease outcomes" (more clinically accurate)
50
51
**Oncology Focus:** Added "specifically designed for breast cancer research"
52
53
**Technical Clarity:** Specified "pathogenic mutation patterns" and "clinically significant genomic variants"
54
55
**Flow:** Improved logical progression from data types → analysis → clinical outputs
56
57
**Professional Tone:** Used terms like "therapeutic implications" and "multi-omics analysis
58
59
### Project Flow
60
61
<p align="center">
62
  <img src="output/x.jpg" alt="High-Level Architecture Diagram" width="1000">
63
  <br>
64
  <em>Figure 1: High-level architecture of the bioscience platform</em>
65
</p>
66
67
#### Services
68
69
    1. ML Model - Training/ Classification/ Fine-tuning
70
    2. RAG System - Bioscince RAG system
71
    3. Data Collector - breast_cancer_recorder
72
    4. Visualizations
73
74
### Project Structure
75
76
    GenomicAnalysisWorkspace/
77
78
    ├── BioEventHouse/                     # Eventhouse and KQL Database for genomic events
79
    │   ├── (Eventhouse data)
80
    │   └── (KQL Database)
81
82
    ├── BioEventHouse_queryset/            # KQL Queryset for querying genomic events
83
84
    ├── Biospecimen_RAG_System/            # Notebook for biospecimen RAG (Retrieval-Augmented Generation) system
85
86
    ├── Biospecimen_Report_Generator/      # Notebook for generating biospecimen reports
87
88
    ├── BiospecimenClassifier/             # Machine learning model for biospecimen classification
89
90
    ├── Data_Engineering/                  # Notebook for data engineering tasks
91
92
    ├── Genomel_H/                         # Lakehouse for genomic data
93
    │   ├── (Lakehouse data)
94
    │   ├── Semantic model
95
    │   └── SQL analytics endpoint
96
97
    ├── GenomicAnalysisPipeline/           # Notebook and experiment for genomic analysis
98
    │   ├── (Notebook)
99
    │   └── (Experiment)
100
101
    ├── GenomicDataProcessing/             # Notebook for genomic data processing
102
103
    └── model_deployment/                  # Notebook and experiment for model deployment
104
        ├── (Notebook)
105
        └── (Experiment)
106
107
<p align="center">
108
  <img src="output/x4.png" alt="Diagram 4">
109
  <br>
110
  <em>Figure 4: MS Fabric - GenomicAnalysisWorkspace</em>
111
</p>
112
113
<p align="center">
114
  <img src="output/x5.png" alt="Diagram 5">
115
  <br>
116
  <em>Figure 5: MS Fabric - GenomicAnalysisWorkspace2</em>
117
</p>
118
119
120
#### Breast Cancer Insight Analysis
121
<p align="center">
122
  <img src="output/x2.png" alt="Diagram 1" width="1000">
123
  <br>
124
  <em>Figure 2: Breast Cancer Recorder</em>
125
</p>
126
127
<p align="center">
128
  <img src="output/x3.png" alt="Diagram 2">
129
  <br>
130
  <em>Figure 3: Breast Cancer Recorder with Records</em>
131
</p>
132
133
##### Downloaded File: "breast_cancer_patients.csv"
134
135
#### File Structure
136
137
    breast-cancer-rag-app/
138
    ├── backend/
139
    │   ├── __pycache__/          # Python cached bytecode
140
    │   ├── venv/                 # Python virtual environment
141
    │   ├── .env                  # Environment variables
142
    │   ├── biospecimen_rag.py    # Core RAG implementation
143
    │   ├── Dockerfile            # Backend container configuration
144
    │   ├── main.py               # FastAPI entry point
145
    │   └── requirements.txt      # Python dependencies
146
147
    └── frontend/
148
        ├── node_modules/         # NPM packages
149
        ├── public/               # Static assets
150
        ├── src/
151
        │   ├── components/
152
        │   │   ├── QueryInterface.js  # Main query component
153
        │   │   ├── ResultsViewer.js   # Results display
154
        │   │   └── StatusIndicator.js # System status UI
155
        │   ├── services/
156
        │   │   └── apiService.js      # API communication
157
        │   ├── App.js           # Root React component
158
        │   ├── index.js         # React entry point
159
        │   ├── reportWebVitals.js # Performance tracking
160
        │   ├── styles.css       # Global styles
161
        │   └── .gitignore       # Frontend ignore rules
162
        ├── Dockerfile           # Frontend container config
163
        ├── package-lock.json    # Exact dependency tree
164
        └── package.json         # Project metadata (implied)
165
166
<p align="center">
167
  <img src="output/x10.png" alt="Diagram 5">
168
  <br>
169
  <em>Figure 10: MS Fabric - Frontend</em>
170
</p>
171
172
<p align="center">
173
  <img src="output/x11.png" alt="Diagram 5">
174
  <br>
175
  <em>Figure 11: MS Fabric - Backend</em>
176
</p>
177
178
<p align="center">
179
  <img src="output/x18.png" alt="Diagram 5">
180
  <br>
181
  <em>Figure 18: MS Fabric - Frontend for a Backend(service working)</em>
182
</p>
183
184
##### For Backend service to work need to update: -
185
186
            OPENAI_GPT4_DEPLOYMENT="gpt-4"
187
            OPENAI_ENDPOINT="https://<your-resource>.openai.azure.com"
188
            OPENAI_API_KEY="<your-key>"
189
            OPENAI_ADA_DEPLOYMENT="text-embedding-ada-002"
190
191
            KUSTO_URI="https://trd-zdxwqrcu1znbqygpxg.z2.kusto.fabric.microsoft.com"
192
            KUSTO_DATABASE="BioEventHouse"
193
            KUSTO_TABLE="biospecimen_embeddings"
194
195
            AZURE_TENANT_ID="<tenant-id>"
196
            AZURE_CLIENT_ID="<client-id>"
197
            AZURE_CLIENT_SECRET="<client-secret>"
198
199
            APP_INSIGHTS_KEY=your-instrumentation-key
200
            SECRET_KEY=your-secret-key-for-flask
201
202
#### File Structure
203
204
205
    breast-cancer-recorder/
206
    ├── build/
207
    ├── static/
208
    ├── asset-manifest.json
209
    ├── index.html
210
    ├── manifest.json
211
    ├── robots.txt
212
    ├── node_modules/
213
    ├── public/
214
    │   ├── index.html
215
    │   ├── manifest.json
216
    │   ├── robots.txt
217
    │   └── reports/                      # New directory for research reports
218
    │       └── breast_cancer_report.md   # Added markdown report
219
    ├── src/
220
    │   ├── components/
221
    │   │   ├── DataTable.js
222
    │   │   └── PatientForm.js
223
    │   ├── utils/
224
    │   ├── App.js
225
    │   ├── index.js
226
    │   ├── reportWebVitals.js
227
    │   ├── styles.css
228
    │   └── report-components/            # New directory for report components
229
    │       ├── ReportViewer.js           # Component to display the markdown
230
    │       └── ReportGenerator.js        # Component to generate dynamic reports
231
    ├── .env
232
    ├── .gitignore
233
    ├── package.json
234
    └── vercel.json
235
236
The above React application collects breast cancer patient data including location, age, cancer stage, weight, and email. It features form validation to ensure accurate data entry, with age restricted to 18-120 years, weight validated as positive numbers, and proper email formatting. . Collected data can be downloaded as CSV for analysis. The app includes error handling and success notifications. The clean interface prioritizes usability while maintaining data integrity, making it suitable for medical professionals to record and manage patient information efficiently.
237
238
239
#### Data Wrangling and PowerBI Analysis
240
241
This Medallion pipeline ingests raw breast cancer patient data (bronze), cleans/validates it (silver), and enriches with analytical features (gold) in Fabric's Lakehouse. The process includes data type conversion, email validation, weight normalization, and risk categorization. PowerBI connects via Direct Lake for real-time visualization of age groups, cancer stages, and geographic distributions. For long-term storage, gold data exports to AWS S3 in Parquet format. The pipeline enables specialists to identify patterns and generate personalized prevention advice based on historical correlations between patient demographics and cancer progression. Microsoft Fabric streamlines this workflow with integrated Spark processing, Delta Lake storage, and PowerBI analytics in one platform.
242
243
<p align="center">
244
  <img src="output/x6.png" alt="Diagram 5">
245
  <br>
246
  <em>Figure 6: MS Fabric - Breast_Cancer_LakeHouse</em>
247
</p>
248
249
<p align="center">
250
  <img src="output/x7.png" alt="Diagram 5">
251
  <br>
252
  <em>Figure 7: MS Fabric - Breast_Cancer_LakeHouse</em>
253
</p>
254
255
<p align="center">
256
  <img src="output/x8.png" alt="Diagram 5">
257
  <br>
258
  <em>Figure 8: MS Fabric - Breast_Cancer_LakeHouse</em>
259
</p>
260
261
<p align="center">
262
  <img src="output/x9.png" alt="Diagram 5">
263
  <br>
264
  <em>Figure 9: MS Fabric - Breast_Cancer_LakeHouse</em>
265
</p>
266
267
268
### Breast Cancer Patient Analytics Report  
269
270
*Generated from Lakehouse Pipeline – {{date}}*  
271
272
#### 1. Key Demographics  
273
274
- **Total Patients:** `{{gold_df.count()}}`  
275
- **Average Age:** `{{stage_analysis_df.select(avg("avg_age")).first()[0]}}` years  
276
- **Weight Distribution:**  
277
278
  - Mean: `{{stage_analysis_df.select(avg("avg_weight")).first()[0]}}` kg  
279
  - Std Dev: `{{stage_analysis_df.select(stddev("avg_weight")).first()[0]}}` kg  
280
281
#### 2. Stage Distribution  
282
283
| Stage | Patients (%) | Avg Age | Top Location |  
284
|-------|-------------|---------|--------------|  
285
| 1     | `{{count_stage1/total*100}}`% | `{{age_stage1}}` | `{{top_loc_stage1}}` |  
286
| 2     | `{{count_stage2/total*100}}`% | `{{age_stage2}}` | `{{top_loc_stage2}}` |  
287
| ...   | ...         | ...     | ...          |  
288
289
**Insight:** Early-stage (1-2) diagnoses are most prevalent in `{{top_location}}`.  
290
291
#### 3. Temporal Trends  
292
293
![Diagnosis Over Time]  
294
- **Peak Diagnoses:** `{{year_with_max_cases}}`  
295
- **Recent Change:** `{{last_3_years_trend}}` (↑/↓)  
296
297
#### 4. AI-Generated Prevention Insights  
298
299
**For Stage {{X}} Patients (Age {{Y}}):**  
300
> "Patients at this stage should prioritize {{GPT-4_advice}}..."
301
302
#### 5. Data Quality Notes  
303
304
- **Complete Records:** `{{valid_records/total*100}}`%  
305
- **Missing Data:**  
306
  - `{{null_cancer_stage}}` missing stage labels  
307
  - `{{null_weight}}` missing weight entries  
308
309
### Methodology  
310
311
- **Data Source:** `breast_cancer_patients.csv`  
312
- **Pipeline:**  
313
  - **Bronze:** Raw ingestion  
314
  - **Silver:** PII pseudonymization + cleaning  
315
  - **Gold:** Analytics + AI enrichment (GPT-4)  
316
- **Tools:** Microsoft Fabric, Power BI, PySpark  
317
318
319
### Key Components
320
321
### Data Storage
322
- **BioEventHouse**: Eventhouse and KQL Database for genomic event data  
323
- **Genomel_H**: Lakehouse for genomic data with semantic model and SQL analytics  
324
325
### Analysis Tools
326
- Multiple Jupyter notebooks for various genomic analysis tasks  
327
- Experiments tracking for machine learning workflows  
328
329
The provided observations outline a genomic machine learning pipeline leveraging MLflow for model management and reproducibility. Key aspects include:
330
331
#### Pipeline Structure
332
333
Artifact Tracking: Model files (model.pkl), environment specifications (conda.yaml, python_env.yaml), and evaluation metrics (ROC curves, confusion matrices) are systematically logged, ensuring reproducibility.
334
335
Runtime Metrics: Training metrics (accuracy, F1-score, recall) are tracked, emphasizing model performance validation for genomic data classification tasks.
336
337
<p align="center">
338
  <img src="output/x16.png" alt="Diagram 5">
339
  <br>
340
  <em>Figure 16: MS Fabric - GenomicAnalysisWorkspace</em>
341
</p>
342
343
<p align="center">
344
  <img src="output/x17.png" alt="Diagram 5">
345
  <br>
346
  <em>Figure 17: MS Fabric - GenomicAnalysisWorkspace</em>
347
</p>
348
349
350
#### MLflow Integration
351
352
The MLmodel file defines metadata for model deployment, including:
353
354
**Dependencies:** Conda/virtualenv environments to replicate training conditions.
355
356
**Model Specifications:** Scikit-learn flavor with input (21 features as float64) and output (int64 labels) schemas, tailored for genomic datasets.
357
358
**Version Control:** Explicit library versions (sklearn 1.2.2, MLflow 2.12.2) prevent dependency conflicts.
359
360
#### Workflow Efficiency
361
362
Unique run_id and experiment IDs enable traceability across genomic analyses.
363
364
**Implications:** This setup ensures reproducibility (via environment isolation), scalability (through MLflow’s tracking), and interpretability (via visualized metrics), addressing common challenges in genomic ML workflows. The focus on structured metadata and standardized evaluation aligns with best practices for translational bioinformatics.
365
366
<p align="center">
367
  <img src="output/x12.png" alt="Diagram 5">
368
  <br>
369
  <em>Figure 12: MS Fabric - Genomic Analysis Pipeline</em>
370
</p>
371
372
<p align="center">
373
  <img src="output/x13.png" alt="Diagram 5">
374
  <br>
375
  <em>Figure 13: MS Fabric - Genomic Analysis Pipeline</em>
376
</p>
377
378
<p align="center">
379
  <img src="output/x14.png" alt="Diagram 5">
380
  <br>
381
  <em>Figure 14: MS Fabric - Genomic Analysi.s Pipeline</em>
382
</p>
383
384
<p align="center">
385
  <img src="output/x15.png" alt="Diagram 5">
386
  <br>
387
  <em>Figure 15: MS Fabric - Genomic Analysis Pipeline</em>
388
</p>
389
390
391
### Machine Learning
392
- **BiospecimenClassifier**: ML model for biospecimen classification  
393
- Model deployment experiments  
394
395
### Reporting
396
- **Biospecimen_RAG_System**: Retrieval-Augmented Generation system  
397
- **Biospecimen_Report_Generator**: Automated report generation  
398
399
### Setup Instructions
400
401
#### 1. Prerequisites
402
- **Azure Account**: Access to Microsoft Fabric, Azure AI Search, and Azure OpenAI.
403
- **Python 3.8+**: Install Python and required libraries.
404
- **Power BI Desktop**: For creating visualizations.
405
- **Microsoft Fabric Workspace**: With contributor permissions.
406
- **Genomic Datasets**: Access to required genomic data sources.
407
408
#### 2. Install Dependencies
409
Install the required Python libraries:
410
411
### 3. Configure Azure Resources
412
#### Microsoft Fabric:
413
414
Create a Fabric workspace and set up OneLake.
415
416
#### Azure OpenAI:
417
418
Set up an OpenAI resource and deploy a GPT-4 model.
419
420
### 4. Update Configuration
421
Replace placeholders (e.g., <api_key>, <connection_string>) in the code with your Azure resource details.
422
423
### Visualization
424
Use Power BI to create interactive dashboards for visualizing:
425
426
Molecular phenotyping profiles.
427
428
Top biomarkers for disease recovery.
429
430
Trends in gene expression.
431
432
### Contributing
433
Contributions are welcome! Please follow these steps:
434
435
### Fork the repository.
436
437
**Repository:** [https://github.com/danielmuthama23/Genomic_Analysis.git](#)  
438
439
440
### Summary
441
442
#### 1. Genomic Analysis Report: Mutation-Disease Association Detection
443
444
This report summarizes findings from genomic data analysis, focusing on detecting disease associations through mutation patterns in breast cancer samples.  
445
446
**Analysis Methods:**  
447
448
- Mutation frequency analysis of key cancer genes  
449
- Protein-protein interaction networks to identify functional clusters  
450
- Metabolic pathway mapping to detect dysregulated processes  
451
452
**Key Datasets:**  
453
454
- `PDC_biospecimen_manifest_03272025_214257.csv`  
455
- Embedded mock genomic data for test and validation  
456
457
---
458
459
### 2. Key Findings  
460
461
#### 2.1 Mutation-Disease Associations 
462
463
![Mutation Counts]
464
465
**Top Pathogenic Mutations:**  
466
467
| Gene    | Mutation Count | Disease-Associated | Percentage |  
468
|---------|---------------|--------------------|------------|  
469
| TP53    | 8             | 8                  | 100%       |  
470
| PIK3CA  | 5             | 5                  | 100%       |  
471
| BRCA1   | 4             | 4                  | 100%       |  
472
473
**Insights:**  
474
475
- **TP53 mutations** were ubiquitous (100% disease-linked), indicating its role as a primary driver.  
476
- **PIK3CA** and **BRCA1/2** mutations showed strong disease associations.  
477
478
---
479
480
<p align="center">
481
  <img src="output/mutation_disease_counts" alt="Diagram 5">
482
  <br>
483
  <em>Figure 19: MS Fabric - Mutation</em>
484
</p>
485
486
487
#### 2.2 Protein Interaction Network  
488
489
![Protein Network]
490
491
**Critical Hubs (High Connectivity):**  
492
493
1. **TP53** (4 interactions)  
494
2. **BRCA1** (3 interactions)  
495
3. **PIK3CA** (3 interactions)  
496
497
**Key Observations:**  
498
499
- Red nodes (PDC-identified proteins) formed central hubs.  
500
- Green edges (activation) dominated oncogenic pathways (e.g., PIK3CA→AKT1).  
501
502
---
503
504
<p align="center">
505
  <img src="output/protein_network" alt="Diagram 5">
506
  <br>
507
  <em>Figure 21: MS Fabric - Mutation</em>
508
</p>
509
510
#### 2.3 Metabolic Pathway Dysregulation  
511
512
![Metabolic Pathways]  
513
514
**Most Dysregulated Pathways:**  
515
516
1. **Glycolysis** (↑ Glucose-6-P, Fructose-1,6-BP)  
517
2. **TCA Cycle** (↓ Succinyl-CoA, ↑ Acetyl-CoA)  
518
3. **Fatty Acid Synthesis** (↑ Malonyl-CoA)  
519
520
**Top Dysregulated Metabolite:**  
521
522
- **Acetyl-CoA** (2.1-fold change, linked to PTEN mutations).  
523
524
---
525
526
<p align="center">
527
  <img src="output/metabolic_pathways" alt="Diagram 5">
528
  <br>
529
  <em>Figure 20: MS Fabric - Metabolic Pathway</em>
530
</p>
531
532
### 3. Disease Detection Methodology  
533
534
#### 3.1 Mutation-Based Detection 
535
536
- **Thresholds:** Genes with >70% disease-associated mutations flagged as high-risk.  
537
- **Validation:** Cross-referenced with COSMIC database.  
538
539
#### 3.2 Network Analysis  
540
541
- Prioritized **hub genes** (e.g., TP53) as biomarkers.  
542
- **Inhibition edges** (red) highlighted drug targets (e.g., PTEN→AKT1).  
543
544
#### 3.3 Metabolic Insights  
545
546
- Glycolysis/TCA cycle disruptions correlated with TP53/PIK3CA mutations.  
547
- High Acetyl-CoA suggests vulnerability to metabolic inhibitors.  
548
549
---
550
551
### 4. Conclusions & Recommendations  
552
553
From the analysis we can conclude:-
554
555
**Diagnostic Markers:**  
556
557
- **TP53 mutations** as universal biomarkers.  
558
- **PIK3CA activation** signals aggressive subtypes.  
559
560
**Therapeutic Targets:**  
561
562
- Target **PIK3CA-AKT1 interactions**.  
563
- Explore **metabolic inhibitors** for Acetyl-CoA-overproducing tumors.  
564
565
**Future Works:**  
566
567
- Validate with clinical outcomes data.  
568
- Expand analysis to RNA-seq.  
569
570
---
571
572
### 5. Files Generated  
573
574
| File                          | Description                                  |  
575
|-------------------------------|----------------------------------------------|  
576
| `mutation_disease_counts.png` | Top mutated genes with disease associations. |  
577
| `protein_network.png`         | Protein interaction network with PDC hubs.   |  
578
| `metabolic_pathways.png`      | Dysregulated metabolic pathways.             |  
579
580
581
---
582
583
**Prepared by:** Daniel Muthama 
584
**Date:** April 2, 2025  
585
**Contact:** (mailto:danielmuthama23@gmail.com)  
586
587
588
---
589
590
### How to Use This Report  
591
592
- **Clinicians:** Focus on TP53/PIK3CA status for patient stratification.  
593
- **Researchers:** Explore metabolic pathways for novel drug combinations.  
594
- **Data Teams:** Replicate pipeline using `DataEngineering.tex`.  
595
596
### License
597
598
This project is licensed under the MIT License. See the LICENSE file for details.
599
600
### Contact
601
602
For questions or feedback, please contact:
603
604
#### Acknowledgments
605
606
Microsoft Fabric for data orchestration.
607
608
    Azure AI Search for retrieval.
609
    Azure OpenAI for natural language generation.