-a/README.md
+b/README.md
 ### GenoHos: AI-Powered Genomic Analysis with Molecular Phenotyping and RAG Chat Interface
 With the rapid advancement of multi-omics technologies, healthcare institutions now generate vast amounts of genomic, proteomic, and metabolomic data. While these datasets hold the key to personalized medicine, they typically reside in disconnected systems - from sequencing machines to EHRs to research databases. Traditionally, integrating this data required teams of bioinformaticians to build complex analysis pipelines, slowing down critical treatment decisions.
 By leveraging mutation Data has led to discovery of novel disease variants. For example breast cancer patient management system serves as a powerful platform for physicians to discover and analyze novel disease variants by integrating clinical and genomic data. By systematically recording patient demographics, treatment outcomes, and mutation profiles, the system enables:
 **Pattern Recognition** - Identifies recurring mutation clusters across specific age groups, ethnicities, and geographic locations, revealing potential founder mutations or environmental risk factors.
 **Variant Alert System** - Automatically flags novel variants and compares them against global databases (COSMIC, ClinVar), highlighting mutations with predicted clinical significance.
 **Treatment Response Analysis** - Correlates specific mutations with drug efficacy, helping physicians identify biomarkers for treatment resistance or sensitivity.
 **Collaborative Research** - Facilitates secure data sharing across institutions, creating a crowdsourced knowledge base for rare variants and atypical presentations.
 **Predictive Modeling** - Uses accumulated data to forecast disease progression patterns and suggest personalized therapeutic approaches based on mutation profiles.
 The system transforms routine clinical documentation into a dynamic discovery tool, where each new patient record contributes to our understanding of breast cancer heterogeneity. Physicians gain real-time insights into how specific mutations influence: -
 . Metastatic patterns
 . Disease progression timelines
 . Survival outcomes
 . Therapeutic vulnerabilities
 By making these correlations visible at the point of care, the platform accelerates the identification of novel disease variants and enables more precise, personalized treatment strategies - bridging the gap between genomic research and clinical practice.
 #### Microsoft Fabric revolutionizes this process by providing:
 Unified Data Lakehouse for harmonizing sequencing data, clinical records, and research repositories
 Low-Code Transformations to clean and standardize omics data without extensive coding
 Built-In ML Capabilities for running predictive models directly on Fabric notebooks
 #### Members/Contributors
 . Daniel Muthama (ML and Backend)
 . Eunice Nduku (Data)
 . Daniel Muruthi (Frontend)
 ### Overview
 This AI-powered oncology platform analyzes integrated genomic, proteomic, and metabolomic data to predict disease outcomes and generate personalized treatment recommendations. The system utilizes Microsoft Fabric for multimodal data orchestration, Azure AI Search for biomedical evidence retrieval, and Azure OpenAI for clinical insights generation. Specifically designed for breast cancer research, it identifies pathogenic mutation patterns, detects clinically significant genomic variants, and synthesizes comprehensive reports highlighting therapeutic implications derived from multi-omics analysis.
 #### Key improvements:
 Precision in Terminology: Changed "disease recovery" to "disease outcomes" (more clinically accurate)
 **Oncology Focus:** Added "specifically designed for breast cancer research"
 **Technical Clarity:** Specified "pathogenic mutation patterns" and "clinically significant genomic variants"
 **Flow:** Improved logical progression from data types → analysis → clinical outputs
 **Professional Tone:** Used terms like "therapeutic implications" and "multi-omics analysis
-### Project Flow
-<p align="center">
+#### Services
-  <img src="output/x.jpg" alt="High-Level Architecture Diagram" width="1000">
-  <br>
+. ML Model - Training/ Classification/ Fine-tuning
-  <em>Figure 1: High-level architecture of the bioscience platform</em>
+. RAG System - Bioscince RAG system
-</p>
+. Data Collector - breast_cancer_recorder
+. Visualizations
-#### Services
+### Project Structure
-. ML Model - Training/ Classification/ Fine-tuning
-. RAG System - Bioscince RAG system
+    GenomicAnalysisWorkspace/
-. Data Collector - breast_cancer_recorder
+    │
-. Visualizations
+    ├── BioEventHouse/                     # Eventhouse and KQL Database for genomic events
+    │   ├── (Eventhouse data)
-### Project Structure
+    │   └── (KQL Database)
+    │
-    GenomicAnalysisWorkspace/
+    ├── BioEventHouse_queryset/            # KQL Queryset for querying genomic events
     │
-    ├── BioEventHouse/                     # Eventhouse and KQL Database for genomic events
+    ├── Biospecimen_RAG_System/            # Notebook for biospecimen RAG (Retrieval-Augmented Generation) system
-    │   ├── (Eventhouse data)
+    │
-    │   └── (KQL Database)
+    ├── Biospecimen_Report_Generator/      # Notebook for generating biospecimen reports
     │
-    ├── BioEventHouse_queryset/            # KQL Queryset for querying genomic events
+    ├── BiospecimenClassifier/             # Machine learning model for biospecimen classification
     │
-    ├── Biospecimen_RAG_System/            # Notebook for biospecimen RAG (Retrieval-Augmented Generation) system
+    ├── Data_Engineering/                  # Notebook for data engineering tasks
     │
-    ├── Biospecimen_Report_Generator/      # Notebook for generating biospecimen reports
+    ├── Genomel_H/                         # Lakehouse for genomic data
-    │
+    │   ├── (Lakehouse data)
-    ├── BiospecimenClassifier/             # Machine learning model for biospecimen classification
+    │   ├── Semantic model
-    │
+    │   └── SQL analytics endpoint
-    ├── Data_Engineering/                  # Notebook for data engineering tasks
+    │
-    │
+    ├── GenomicAnalysisPipeline/           # Notebook and experiment for genomic analysis
-    ├── Genomel_H/                         # Lakehouse for genomic data
+    │   ├── (Notebook)
-    │   ├── (Lakehouse data)
+    │   └── (Experiment)
-    │   ├── Semantic model
+    │
-    │   └── SQL analytics endpoint
+    ├── GenomicDataProcessing/             # Notebook for genomic data processing
     │
-    ├── GenomicAnalysisPipeline/           # Notebook and experiment for genomic analysis
+    └── model_deployment/                  # Notebook and experiment for model deployment
-    │   ├── (Notebook)
+        ├── (Notebook)
-    │   └── (Experiment)
+        └── (Experiment)
-    │
-    ├── GenomicDataProcessing/             # Notebook for genomic data processing
-    │
-    └── model_deployment/                  # Notebook and experiment for model deployment
-        ├── (Notebook)
+##### Downloaded File: "breast_cancer_patients.csv"
-        └── (Experiment)
+#### File Structure
-<p align="center">
-  <img src="output/x4.png" alt="Diagram 4">
+    breast-cancer-rag-app/
-  <br>
+    ├── backend/
-  <em>Figure 4: MS Fabric - GenomicAnalysisWorkspace</em>
+    │   ├── __pycache__/          # Python cached bytecode
-</p>
+    │   ├── venv/                 # Python virtual environment
+    │   ├── .env                  # Environment variables
-<p align="center">
+    │   ├── biospecimen_rag.py    # Core RAG implementation
-  <img src="output/x5.png" alt="Diagram 5">
+    │   ├── Dockerfile            # Backend container configuration
-  <br>
+    │   ├── main.py               # FastAPI entry point
-  <em>Figure 5: MS Fabric - GenomicAnalysisWorkspace2</em>
+    │   └── requirements.txt      # Python dependencies
-</p>
+    │
+    └── frontend/
+        ├── node_modules/         # NPM packages
-#### Breast Cancer Insight Analysis
+        ├── public/               # Static assets
-<p align="center">
+        ├── src/
-  <img src="output/x2.png" alt="Diagram 1" width="1000">
+        │   ├── components/
-  <br>
+        │   │   ├── QueryInterface.js  # Main query component
-  <em>Figure 2: Breast Cancer Recorder</em>
+        │   │   ├── ResultsViewer.js   # Results display
-</p>
+        │   │   └── StatusIndicator.js # System status UI
+        │   ├── services/
-<p align="center">
+        │   │   └── apiService.js      # API communication
-  <img src="output/x3.png" alt="Diagram 2">
+        │   ├── App.js           # Root React component
-  <br>
+        │   ├── index.js         # React entry point
-  <em>Figure 3: Breast Cancer Recorder with Records</em>
+        │   ├── reportWebVitals.js # Performance tracking
-</p>
+        │   ├── styles.css       # Global styles
+        │   └── .gitignore       # Frontend ignore rules
-##### Downloaded File: "breast_cancer_patients.csv"
+        ├── Dockerfile           # Frontend container config
+        ├── package-lock.json    # Exact dependency tree
-#### File Structure
+        └── package.json         # Project metadata (implied)
-    breast-cancer-rag-app/
-    ├── backend/
-    │   ├── __pycache__/          # Python cached bytecode
-    │   ├── venv/                 # Python virtual environment
-    │   ├── .env                  # Environment variables
+##### For Backend service to work need to update: -
-    │   ├── biospecimen_rag.py    # Core RAG implementation
-    │   ├── Dockerfile            # Backend container configuration
+            OPENAI_GPT4_DEPLOYMENT="gpt-4"
-    │   ├── main.py               # FastAPI entry point
+            OPENAI_ENDPOINT="https://<your-resource>.openai.azure.com"
-    │   └── requirements.txt      # Python dependencies
+            OPENAI_API_KEY="<your-key>"
-    │
+            OPENAI_ADA_DEPLOYMENT="text-embedding-ada-002"
-    └── frontend/
-        ├── node_modules/         # NPM packages
+            KUSTO_URI="https://trd-zdxwqrcu1znbqygpxg.z2.kusto.fabric.microsoft.com"
-        ├── public/               # Static assets
+            KUSTO_DATABASE="BioEventHouse"
-        ├── src/
+            KUSTO_TABLE="biospecimen_embeddings"
-        │   ├── components/
-        │   │   ├── QueryInterface.js  # Main query component
+            AZURE_TENANT_ID="<tenant-id>"
-        │   │   ├── ResultsViewer.js   # Results display
+            AZURE_CLIENT_ID="<client-id>"
-        │   │   └── StatusIndicator.js # System status UI
+            AZURE_CLIENT_SECRET="<client-secret>"
-        │   ├── services/
-        │   │   └── apiService.js      # API communication
+            APP_INSIGHTS_KEY=your-instrumentation-key
-        │   ├── App.js           # Root React component
+            SECRET_KEY=your-secret-key-for-flask
-        │   ├── index.js         # React entry point
-        │   ├── reportWebVitals.js # Performance tracking
+#### File Structure
-        │   ├── styles.css       # Global styles
-        │   └── .gitignore       # Frontend ignore rules
-        ├── Dockerfile           # Frontend container config
+    breast-cancer-recorder/
-        ├── package-lock.json    # Exact dependency tree
+    ├── build/
-        └── package.json         # Project metadata (implied)
+    ├── static/
+    ├── asset-manifest.json
-<p align="center">
+    ├── index.html
-  <img src="output/x10.png" alt="Diagram 5">
+    ├── manifest.json
-  <br>
+    ├── robots.txt
-  <em>Figure 10: MS Fabric - Frontend</em>
+    ├── node_modules/
-</p>
+    ├── public/
+    │   ├── index.html
-<p align="center">
+    │   ├── manifest.json
-  <img src="output/x11.png" alt="Diagram 5">
+    │   ├── robots.txt
-  <br>
+    │   └── reports/                      # New directory for research reports
-  <em>Figure 11: MS Fabric - Backend</em>
+    │       └── breast_cancer_report.md   # Added markdown report
-</p>
+    ├── src/
+    │   ├── components/
-<p align="center">
+    │   │   ├── DataTable.js
-  <img src="output/x18.png" alt="Diagram 5">
+    │   │   └── PatientForm.js
-  <br>
+    │   ├── utils/
-  <em>Figure 18: MS Fabric - Frontend for a Backend(service working)</em>
+    │   ├── App.js
-</p>
+    │   ├── index.js
+    │   ├── reportWebVitals.js
-##### For Backend service to work need to update: -
+    │   ├── styles.css
+    │   └── report-components/            # New directory for report components
-            OPENAI_GPT4_DEPLOYMENT="gpt-4"
+    │       ├── ReportViewer.js           # Component to display the markdown
-            OPENAI_ENDPOINT="https://<your-resource>.openai.azure.com"
+    │       └── ReportGenerator.js        # Component to generate dynamic reports
-            OPENAI_API_KEY="<your-key>"
+    ├── .env
-            OPENAI_ADA_DEPLOYMENT="text-embedding-ada-002"
+    ├── .gitignore
+    ├── package.json
-            KUSTO_URI="https://trd-zdxwqrcu1znbqygpxg.z2.kusto.fabric.microsoft.com"
+    └── vercel.json
-            KUSTO_DATABASE="BioEventHouse"
-            KUSTO_TABLE="biospecimen_embeddings"
+The above React application collects breast cancer patient data including location, age, cancer stage, weight, and email. It features form validation to ensure accurate data entry, with age restricted to 18-120 years, weight validated as positive numbers, and proper email formatting. . Collected data can be downloaded as CSV for analysis. The app includes error handling and success notifications. The clean interface prioritizes usability while maintaining data integrity, making it suitable for medical professionals to record and manage patient information efficiently.
-            AZURE_TENANT_ID="<tenant-id>"
-            AZURE_CLIENT_ID="<client-id>"
+#### Data Wrangling and PowerBI Analysis
-            AZURE_CLIENT_SECRET="<client-secret>"
+This Medallion pipeline ingests raw breast cancer patient data (bronze), cleans/validates it (silver), and enriches with analytical features (gold) in Fabric's Lakehouse. The process includes data type conversion, email validation, weight normalization, and risk categorization. PowerBI connects via Direct Lake for real-time visualization of age groups, cancer stages, and geographic distributions. For long-term storage, gold data exports to AWS S3 in Parquet format. The pipeline enables specialists to identify patterns and generate personalized prevention advice based on historical correlations between patient demographics and cancer progression. Microsoft Fabric streamlines this workflow with integrated Spark processing, Delta Lake storage, and PowerBI analytics in one platform.
-            APP_INSIGHTS_KEY=your-instrumentation-key
-            SECRET_KEY=your-secret-key-for-flask
-#### File Structure
+### Breast Cancer Patient Analytics Report
-    breast-cancer-recorder/
-    ├── build/
+*Generated from Lakehouse Pipeline – {{date}}*
-    ├── static/
-    ├── asset-manifest.json
+#### 1. Key Demographics
-    ├── index.html
-    ├── manifest.json
+- **Total Patients:** `{{gold_df.count()}}`
-    ├── robots.txt
+- **Average Age:** `{{stage_analysis_df.select(avg("avg_age")).first()[0]}}` years
-    ├── node_modules/
+- **Weight Distribution:**
-    ├── public/
-    │   ├── index.html
+  - Mean: `{{stage_analysis_df.select(avg("avg_weight")).first()[0]}}` kg
-    │   ├── manifest.json
+  - Std Dev: `{{stage_analysis_df.select(stddev("avg_weight")).first()[0]}}` kg
-    │   ├── robots.txt
-    │   └── reports/                      # New directory for research reports
+#### 2. Stage Distribution
-    │       └── breast_cancer_report.md   # Added markdown report
-    ├── src/
+| Stage | Patients (%) | Avg Age | Top Location |
-    │   ├── components/
+|-------|-------------|---------|--------------|
-    │   │   ├── DataTable.js
+| 1     | `{{count_stage1/total*100}}`% | `{{age_stage1}}` | `{{top_loc_stage1}}` |
-    │   │   └── PatientForm.js
+| 2     | `{{count_stage2/total*100}}`% | `{{age_stage2}}` | `{{top_loc_stage2}}` |
-    │   ├── utils/
+| ...   | ...         | ...     | ...          |
-    │   ├── App.js
-    │   ├── index.js
+**Insight:** Early-stage (1-2) diagnoses are most prevalent in `{{top_location}}`.
-    │   ├── reportWebVitals.js
-    │   ├── styles.css
+#### 3. Temporal Trends
-    │   └── report-components/            # New directory for report components
-    │       ├── ReportViewer.js           # Component to display the markdown
+![Diagnosis Over Time]
-    │       └── ReportGenerator.js        # Component to generate dynamic reports
+- **Peak Diagnoses:** `{{year_with_max_cases}}`
-    ├── .env
+- **Recent Change:** `{{last_3_years_trend}}` (↑/↓)
-    ├── .gitignore
-    ├── package.json
+#### 4. AI-Generated Prevention Insights
-    └── vercel.json
+**For Stage {{X}} Patients (Age {{Y}}):**
-The above React application collects breast cancer patient data including location, age, cancer stage, weight, and email. It features form validation to ensure accurate data entry, with age restricted to 18-120 years, weight validated as positive numbers, and proper email formatting. . Collected data can be downloaded as CSV for analysis. The app includes error handling and success notifications. The clean interface prioritizes usability while maintaining data integrity, making it suitable for medical professionals to record and manage patient information efficiently.
+> "Patients at this stage should prioritize {{GPT-4_advice}}..."
+#### 5. Data Quality Notes
-#### Data Wrangling and PowerBI Analysis
+- **Complete Records:** `{{valid_records/total*100}}`%
-This Medallion pipeline ingests raw breast cancer patient data (bronze), cleans/validates it (silver), and enriches with analytical features (gold) in Fabric's Lakehouse. The process includes data type conversion, email validation, weight normalization, and risk categorization. PowerBI connects via Direct Lake for real-time visualization of age groups, cancer stages, and geographic distributions. For long-term storage, gold data exports to AWS S3 in Parquet format. The pipeline enables specialists to identify patterns and generate personalized prevention advice based on historical correlations between patient demographics and cancer progression. Microsoft Fabric streamlines this workflow with integrated Spark processing, Delta Lake storage, and PowerBI analytics in one platform.
+- **Missing Data:**
+  - `{{null_cancer_stage}}` missing stage labels
-<p align="center">
+  - `{{null_weight}}` missing weight entries
-  <img src="output/x6.png" alt="Diagram 5">
-  <br>
+### Methodology
-  <em>Figure 6: MS Fabric - Breast_Cancer_LakeHouse</em>
-</p>
+- **Data Source:** `breast_cancer_patients.csv`
+- **Pipeline:**
-<p align="center">
+  - **Bronze:** Raw ingestion
-  <img src="output/x7.png" alt="Diagram 5">
+  - **Silver:** PII pseudonymization + cleaning
-  <br>
+  - **Gold:** Analytics + AI enrichment (GPT-4)
-  <em>Figure 7: MS Fabric - Breast_Cancer_LakeHouse</em>
+- **Tools:** Microsoft Fabric, Power BI, PySpark
-</p>
-<p align="center">
+### Key Components
-  <img src="output/x8.png" alt="Diagram 5">
-  <br>
+### Data Storage
-  <em>Figure 8: MS Fabric - Breast_Cancer_LakeHouse</em>
+- **BioEventHouse**: Eventhouse and KQL Database for genomic event data
-</p>
+- **Genomel_H**: Lakehouse for genomic data with semantic model and SQL analytics
-<p align="center">
+### Analysis Tools
-  <img src="output/x9.png" alt="Diagram 5">
+- Multiple Jupyter notebooks for various genomic analysis tasks
-  <br>
+- Experiments tracking for machine learning workflows
-  <em>Figure 9: MS Fabric - Breast_Cancer_LakeHouse</em>
-</p>
+The provided observations outline a genomic machine learning pipeline leveraging MLflow for model management and reproducibility. Key aspects include:
+#### Pipeline Structure
-### Breast Cancer Patient Analytics Report
+Artifact Tracking: Model files (model.pkl), environment specifications (conda.yaml, python_env.yaml), and evaluation metrics (ROC curves, confusion matrices) are systematically logged, ensuring reproducibility.
-*Generated from Lakehouse Pipeline – {{date}}*
+Runtime Metrics: Training metrics (accuracy, F1-score, recall) are tracked, emphasizing model performance validation for genomic data classification tasks.
-#### 1. Key Demographics
-- **Total Patients:** `{{gold_df.count()}}`
-- **Average Age:** `{{stage_analysis_df.select(avg("avg_age")).first()[0]}}` years
+#### MLflow Integration
-- **Weight Distribution:**
+The MLmodel file defines metadata for model deployment, including:
-  - Mean: `{{stage_analysis_df.select(avg("avg_weight")).first()[0]}}` kg
-  - Std Dev: `{{stage_analysis_df.select(stddev("avg_weight")).first()[0]}}` kg
+**Dependencies:** Conda/virtualenv environments to replicate training conditions.
-#### 2. Stage Distribution
+**Model Specifications:** Scikit-learn flavor with input (21 features as float64) and output (int64 labels) schemas, tailored for genomic datasets.
-| Stage | Patients (%) | Avg Age | Top Location |
+**Version Control:** Explicit library versions (sklearn 1.2.2, MLflow 2.12.2) prevent dependency conflicts.
-|-------|-------------|---------|--------------|
-| 1     | `{{count_stage1/total*100}}`% | `{{age_stage1}}` | `{{top_loc_stage1}}` |
+#### Workflow Efficiency
-| 2     | `{{count_stage2/total*100}}`% | `{{age_stage2}}` | `{{top_loc_stage2}}` |
-| ...   | ...         | ...     | ...          |
+Unique run_id and experiment IDs enable traceability across genomic analyses.
-**Insight:** Early-stage (1-2) diagnoses are most prevalent in `{{top_location}}`.
+**Implications:** This setup ensures reproducibility (via environment isolation), scalability (through MLflow’s tracking), and interpretability (via visualized metrics), addressing common challenges in genomic ML workflows. The focus on structured metadata and standardized evaluation aligns with best practices for translational bioinformatics.
-#### 3. Temporal Trends
-![Diagnosis Over Time]
-- **Peak Diagnoses:** `{{year_with_max_cases}}`
+### Machine Learning
-- **Recent Change:** `{{last_3_years_trend}}` (↑/↓)
+- **BiospecimenClassifier**: ML model for biospecimen classification
+- Model deployment experiments
-#### 4. AI-Generated Prevention Insights
+### Reporting
-**For Stage {{X}} Patients (Age {{Y}}):**
+- **Biospecimen_RAG_System**: Retrieval-Augmented Generation system
-> "Patients at this stage should prioritize {{GPT-4_advice}}..."
+- **Biospecimen_Report_Generator**: Automated report generation
-#### 5. Data Quality Notes
+### Setup Instructions
-- **Complete Records:** `{{valid_records/total*100}}`%
+#### 1. Prerequisites
-- **Missing Data:**
+- **Azure Account**: Access to Microsoft Fabric, Azure AI Search, and Azure OpenAI.
-  - `{{null_cancer_stage}}` missing stage labels
+- **Python 3.8+**: Install Python and required libraries.
-  - `{{null_weight}}` missing weight entries
+- **Power BI Desktop**: For creating visualizations.
+- **Microsoft Fabric Workspace**: With contributor permissions.
-### Methodology
+- **Genomic Datasets**: Access to required genomic data sources.
-- **Data Source:** `breast_cancer_patients.csv`
+#### 2. Install Dependencies
-- **Pipeline:**
+Install the required Python libraries:
-  - **Bronze:** Raw ingestion
-  - **Silver:** PII pseudonymization + cleaning
+### 3. Configure Azure Resources
-  - **Gold:** Analytics + AI enrichment (GPT-4)
+#### Microsoft Fabric:
-- **Tools:** Microsoft Fabric, Power BI, PySpark
+Create a Fabric workspace and set up OneLake.
-### Key Components
+#### Azure OpenAI:
-### Data Storage
+Set up an OpenAI resource and deploy a GPT-4 model.
-- **BioEventHouse**: Eventhouse and KQL Database for genomic event data
-- **Genomel_H**: Lakehouse for genomic data with semantic model and SQL analytics
+### 4. Update Configuration
+Replace placeholders (e.g., <api_key>, <connection_string>) in the code with your Azure resource details.
-### Analysis Tools
-- Multiple Jupyter notebooks for various genomic analysis tasks
+### Visualization
-- Experiments tracking for machine learning workflows
+Use Power BI to create interactive dashboards for visualizing:
-The provided observations outline a genomic machine learning pipeline leveraging MLflow for model management and reproducibility. Key aspects include:
+Molecular phenotyping profiles.
-#### Pipeline Structure
+Top biomarkers for disease recovery.
-Artifact Tracking: Model files (model.pkl), environment specifications (conda.yaml, python_env.yaml), and evaluation metrics (ROC curves, confusion matrices) are systematically logged, ensuring reproducibility.
+Trends in gene expression.
-Runtime Metrics: Training metrics (accuracy, F1-score, recall) are tracked, emphasizing model performance validation for genomic data classification tasks.
+### Contributing
+Contributions are welcome! Please follow these steps:
-<p align="center">
-  <img src="output/x16.png" alt="Diagram 5">
+### Fork the repository.
-  <br>
-  <em>Figure 16: MS Fabric - GenomicAnalysisWorkspace</em>
+**Repository:** [https://github.com/danielmuthama23/Genomic_Analysis.git](#)
-</p>
-<p align="center">
+### Summary
-  <img src="output/x17.png" alt="Diagram 5">
-  <br>
+#### 1. Genomic Analysis Report: Mutation-Disease Association Detection
-  <em>Figure 17: MS Fabric - GenomicAnalysisWorkspace</em>
-</p>
+This report summarizes findings from genomic data analysis, focusing on detecting disease associations through mutation patterns in breast cancer samples.
+**Analysis Methods:**
-#### MLflow Integration
+- Mutation frequency analysis of key cancer genes
-The MLmodel file defines metadata for model deployment, including:
+- Protein-protein interaction networks to identify functional clusters
+- Metabolic pathway mapping to detect dysregulated processes
-**Dependencies:** Conda/virtualenv environments to replicate training conditions.
+**Key Datasets:**
-**Model Specifications:** Scikit-learn flavor with input (21 features as float64) and output (int64 labels) schemas, tailored for genomic datasets.
+- `PDC_biospecimen_manifest_03272025_214257.csv`
-**Version Control:** Explicit library versions (sklearn 1.2.2, MLflow 2.12.2) prevent dependency conflicts.
+- Embedded mock genomic data for test and validation
-#### Workflow Efficiency
+---
-Unique run_id and experiment IDs enable traceability across genomic analyses.
+### 2. Key Findings
-**Implications:** This setup ensures reproducibility (via environment isolation), scalability (through MLflow’s tracking), and interpretability (via visualized metrics), addressing common challenges in genomic ML workflows. The focus on structured metadata and standardized evaluation aligns with best practices for translational bioinformatics.
+#### 2.1 Mutation-Disease Associations
-<p align="center">
+![Mutation Counts]
-  <img src="output/x12.png" alt="Diagram 5">
-  <br>
+**Top Pathogenic Mutations:**
-  <em>Figure 12: MS Fabric - Genomic Analysis Pipeline</em>
-</p>
+| Gene    | Mutation Count | Disease-Associated | Percentage |
+|---------|---------------|--------------------|------------|
-<p align="center">
+| TP53    | 8             | 8                  | 100%       |
-  <img src="output/x13.png" alt="Diagram 5">
+| PIK3CA  | 5             | 5                  | 100%       |
-  <br>
+| BRCA1   | 4             | 4                  | 100%       |
-  <em>Figure 13: MS Fabric - Genomic Analysis Pipeline</em>
-</p>
+**Insights:**
-<p align="center">
+- **TP53 mutations** were ubiquitous (100% disease-linked), indicating its role as a primary driver.
-  <img src="output/x14.png" alt="Diagram 5">
+- **PIK3CA** and **BRCA1/2** mutations showed strong disease associations.
-  <br>
-  <em>Figure 14: MS Fabric - Genomic Analysi.s Pipeline</em>
+---
-</p>
-<p align="center">
-  <img src="output/x15.png" alt="Diagram 5">
+#### 2.2 Protein Interaction Network
-  <br>
-  <em>Figure 15: MS Fabric - Genomic Analysis Pipeline</em>
+![Protein Network]
-</p>
+**Critical Hubs (High Connectivity):**
-### Machine Learning
+. **TP53** (4 interactions)
-- **BiospecimenClassifier**: ML model for biospecimen classification
+. **BRCA1** (3 interactions)
-- Model deployment experiments
+. **PIK3CA** (3 interactions)
-### Reporting
+**Key Observations:**
-- **Biospecimen_RAG_System**: Retrieval-Augmented Generation system
-- **Biospecimen_Report_Generator**: Automated report generation
+- Red nodes (PDC-identified proteins) formed central hubs.
+- Green edges (activation) dominated oncogenic pathways (e.g., PIK3CA→AKT1).
-### Setup Instructions
+---
-#### 1. Prerequisites
-- **Azure Account**: Access to Microsoft Fabric, Azure AI Search, and Azure OpenAI.
-- **Python 3.8+**: Install Python and required libraries.
-- **Power BI Desktop**: For creating visualizations.
+#### 2.3 Metabolic Pathway Dysregulation
-- **Microsoft Fabric Workspace**: With contributor permissions.
-- **Genomic Datasets**: Access to required genomic data sources.
+![Metabolic Pathways]
-#### 2. Install Dependencies
+**Most Dysregulated Pathways:**
-Install the required Python libraries:
+. **Glycolysis** (↑ Glucose-6-P, Fructose-1,6-BP)
-### 3. Configure Azure Resources
+. **TCA Cycle** (↓ Succinyl-CoA, ↑ Acetyl-CoA)
-#### Microsoft Fabric:
+. **Fatty Acid Synthesis** (↑ Malonyl-CoA)
-Create a Fabric workspace and set up OneLake.
+**Top Dysregulated Metabolite:**
-#### Azure OpenAI:
+- **Acetyl-CoA** (2.1-fold change, linked to PTEN mutations).
-Set up an OpenAI resource and deploy a GPT-4 model.
+---
-### 4. Update Configuration
-Replace placeholders (e.g., <api_key>, <connection_string>) in the code with your Azure resource details.
+### 3. Disease Detection Methodology
-### Visualization
-Use Power BI to create interactive dashboards for visualizing:
+#### 3.1 Mutation-Based Detection
-Molecular phenotyping profiles.
+- **Thresholds:** Genes with >70% disease-associated mutations flagged as high-risk.
+- **Validation:** Cross-referenced with COSMIC database.
-Top biomarkers for disease recovery.
+#### 3.2 Network Analysis
-Trends in gene expression.
+- Prioritized **hub genes** (e.g., TP53) as biomarkers.
-### Contributing
+- **Inhibition edges** (red) highlighted drug targets (e.g., PTEN→AKT1).
-Contributions are welcome! Please follow these steps:
+#### 3.3 Metabolic Insights
-### Fork the repository.
+- Glycolysis/TCA cycle disruptions correlated with TP53/PIK3CA mutations.
-**Repository:** [https://github.com/danielmuthama23/Genomic_Analysis.git](#)
+- High Acetyl-CoA suggests vulnerability to metabolic inhibitors.
+---
-### Summary
+### 4. Conclusions & Recommendations
-#### 1. Genomic Analysis Report: Mutation-Disease Association Detection
+From the analysis we can conclude:-
-This report summarizes findings from genomic data analysis, focusing on detecting disease associations through mutation patterns in breast cancer samples.
+**Diagnostic Markers:**
-**Analysis Methods:**
+- **TP53 mutations** as universal biomarkers.
-- Mutation frequency analysis of key cancer genes
+- **PIK3CA activation** signals aggressive subtypes.
-- Protein-protein interaction networks to identify functional clusters
-- Metabolic pathway mapping to detect dysregulated processes
+**Therapeutic Targets:**
-**Key Datasets:**
+- Target **PIK3CA-AKT1 interactions**.
+- Explore **metabolic inhibitors** for Acetyl-CoA-overproducing tumors.
-- `PDC_biospecimen_manifest_03272025_214257.csv`
-- Embedded mock genomic data for test and validation
+**Future Works:**
----
+- Validate with clinical outcomes data.
+- Expand analysis to RNA-seq.
-### 2. Key Findings
+---
-#### 2.1 Mutation-Disease Associations
+### 5. Files Generated
-![Mutation Counts]
+| File                          | Description                                  |
-**Top Pathogenic Mutations:**
+|-------------------------------|----------------------------------------------|
+| `mutation_disease_counts.png` | Top mutated genes with disease associations. |
-| Gene    | Mutation Count | Disease-Associated | Percentage |
+| `protein_network.png`         | Protein interaction network with PDC hubs.   |
-|---------|---------------|--------------------|------------|
+| `metabolic_pathways.png`      | Dysregulated metabolic pathways.             |
-| TP53    | 8             | 8                  | 100%       |
-| PIK3CA  | 5             | 5                  | 100%       |
-| BRCA1   | 4             | 4                  | 100%       |
+---
-**Insights:**
+**Prepared by:** Daniel Muthama
+**Date:** April 2, 2025
-- **TP53 mutations** were ubiquitous (100% disease-linked), indicating its role as a primary driver.
+**Contact:** (mailto:danielmuthama23@gmail.com)
-- **PIK3CA** and **BRCA1/2** mutations showed strong disease associations.
 ---
-<p align="center">
+### How to Use This Report
-  <img src="output/mutation_disease_counts" alt="Diagram 5">
-  <br>
+- **Clinicians:** Focus on TP53/PIK3CA status for patient stratification.
-  <em>Figure 19: MS Fabric - Mutation</em>
+- **Researchers:** Explore metabolic pathways for novel drug combinations.
-</p>
+- **Data Teams:** Replicate pipeline using `DataEngineering.tex`.
+### License
-#### 2.2 Protein Interaction Network
+This project is licensed under the MIT License. See the LICENSE file for details.
-![Protein Network]
+### Contact
-**Critical Hubs (High Connectivity):**
+For questions or feedback, please contact:
-. **TP53** (4 interactions)
-. **BRCA1** (3 interactions)
+#### Acknowledgments
-. **PIK3CA** (3 interactions)
+Microsoft Fabric for data orchestration.
-**Key Observations:**
+    Azure AI Search for retrieval.
-- Red nodes (PDC-identified proteins) formed central hubs.
-- Green edges (activation) dominated oncogenic pathways (e.g., PIK3CA→AKT1).
----
-<p align="center">
-  <img src="output/protein_network" alt="Diagram 5">
-  <br>
-  <em>Figure 21: MS Fabric - Mutation</em>
-</p>
-#### 2.3 Metabolic Pathway Dysregulation
-![Metabolic Pathways]
-**Most Dysregulated Pathways:**
-. **Glycolysis** (↑ Glucose-6-P, Fructose-1,6-BP)
-. **TCA Cycle** (↓ Succinyl-CoA, ↑ Acetyl-CoA)
-. **Fatty Acid Synthesis** (↑ Malonyl-CoA)
-**Top Dysregulated Metabolite:**
-- **Acetyl-CoA** (2.1-fold change, linked to PTEN mutations).
----
-<p align="center">
-  <img src="output/metabolic_pathways" alt="Diagram 5">
-  <br>
-  <em>Figure 20: MS Fabric - Metabolic Pathway</em>
-</p>
-### 3. Disease Detection Methodology
-#### 3.1 Mutation-Based Detection
-- **Thresholds:** Genes with >70% disease-associated mutations flagged as high-risk.
-- **Validation:** Cross-referenced with COSMIC database.
-#### 3.2 Network Analysis
-- Prioritized **hub genes** (e.g., TP53) as biomarkers.
-- **Inhibition edges** (red) highlighted drug targets (e.g., PTEN→AKT1).
-#### 3.3 Metabolic Insights
-- Glycolysis/TCA cycle disruptions correlated with TP53/PIK3CA mutations.
-- High Acetyl-CoA suggests vulnerability to metabolic inhibitors.
----
-### 4. Conclusions & Recommendations
-From the analysis we can conclude:-
-**Diagnostic Markers:**
-- **TP53 mutations** as universal biomarkers.
-- **PIK3CA activation** signals aggressive subtypes.
-**Therapeutic Targets:**
-- Target **PIK3CA-AKT1 interactions**.
-- Explore **metabolic inhibitors** for Acetyl-CoA-overproducing tumors.
-**Future Works:**
-- Validate with clinical outcomes data.
-- Expand analysis to RNA-seq.
----
-### 5. Files Generated
-| File                          | Description                                  |
-|-------------------------------|----------------------------------------------|
-| `mutation_disease_counts.png` | Top mutated genes with disease associations. |
-| `protein_network.png`         | Protein interaction network with PDC hubs.   |
-| `metabolic_pathways.png`      | Dysregulated metabolic pathways.             |
----
-**Prepared by:** Daniel Muthama
-**Date:** April 2, 2025
-**Contact:** (mailto:danielmuthama23@gmail.com)
----
-### How to Use This Report
-- **Clinicians:** Focus on TP53/PIK3CA status for patient stratification.
-- **Researchers:** Explore metabolic pathways for novel drug combinations.
-- **Data Teams:** Replicate pipeline using `DataEngineering.tex`.
-### License
-This project is licensed under the MIT License. See the LICENSE file for details.
-### Contact
-For questions or feedback, please contact:
-#### Acknowledgments
-Microsoft Fabric for data orchestration.
-    Azure AI Search for retrieval.
     Azure OpenAI for natural language generation.