|
a |
|
b/README.md |
|
|
1 |
# AI4All_Genomics |
|
|
2 |
|
|
|
3 |
AI4All@Princeton is a summer camp that aims to promote diversity in computer science by teaching AI to young students of diverse backgrounds (https://ai4all.princeton.edu). In this module, we will be investigating our genomic diversity by exploring natural genomic variation between world populations. |
|
|
4 |
|
|
|
5 |
Relevant Links: |
|
|
6 |
- IGSR home: Genohttp://www.internationalgenome.org/home |
|
|
7 |
- Phase 3 1000 Genomes Results: https://www.nature.com/articles/nature15393#abstract |
|
|
8 |
- Phase 3 1000 Genomes Data: https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ |
|
|
9 |
- Phase 3 1000 Genomes Processing Example: https://bitbucket.org/remills/1000gp_sv_phase3/src/master/ |
|
|
10 |
- *** see gwas_sv_ld_filt_af.txt |
|
|
11 |
|
|
|
12 |
Data Preprocessing |
|
|
13 |
- Run GATK software to subset SNPs: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.0.0/org_broadinstitute_hellbender_tools_walkers_variantutils_SelectVariants.php |
|
|
14 |
- run ./gatk/gatk SelectVariants -V input.vcf -O output.vcf --keep-ids gwas_sv_ld_RSIDs.list |
|
|
15 |
- example output.vcf: chrX_filtered.txt |
|
|
16 |
- concatenate vcf files using vcf-tools |
|
|
17 |
- https://vcftools.github.io/man_latest.html |
|
|
18 |
|
|
|
19 |
Input Data: |
|
|
20 |
- "chr01-22_filtered.vcf" |
|
|
21 |
- to download: https://drive.google.com/drive/folders/1O7cRyGbEHrkjiCAkaKODN0V2HYTDXjss |
|
|
22 |
|
|
|
23 |
To Do: |
|
|
24 |
- Outline first two weeks |
|
|
25 |
- Implement first two weeks notebooks (goals) |
|
|
26 |
- start first 3 mini lectures |
|
|
27 |
|
|
|
28 |
- plan (and test) which ML algorithms to introduce to students (clustering, standard prediction, etc) |
|
|
29 |
- brainstorm intro slides to SNPs |
|
|
30 |
- explore with python and benchmark runtimes |
|
|
31 |
|
|
|
32 |
Files: |
|
|
33 |
- 1000genomes_dataExploration.ipynb: preliminary exploration of 1000 Genomes data (PCA, SVM, data cleanup and filtering) |