Diff of /README.md [000000] .. [1a0ad7]

Switch to side-by-side view

--- a
+++ b/README.md
@@ -0,0 +1,33 @@
+# AI4All_Genomics
+
+AI4All@Princeton is a summer camp that aims to promote diversity in computer science by teaching AI to young students of diverse backgrounds (https://ai4all.princeton.edu).  In this module, we will be investigating our genomic diversity by exploring natural genomic variation between world populations.
+
+Relevant Links:
+- IGSR home: Genohttp://www.internationalgenome.org/home
+- Phase 3 1000 Genomes Results: https://www.nature.com/articles/nature15393#abstract
+- Phase 3 1000 Genomes Data: https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/
+- Phase 3 1000 Genomes Processing Example: https://bitbucket.org/remills/1000gp_sv_phase3/src/master/
+  - *** see gwas_sv_ld_filt_af.txt 
+  
+Data Preprocessing
+- Run GATK software to subset SNPs: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.0.0/org_broadinstitute_hellbender_tools_walkers_variantutils_SelectVariants.php
+  - run ./gatk/gatk SelectVariants -V input.vcf -O output.vcf --keep-ids gwas_sv_ld_RSIDs.list 
+    - example output.vcf: chrX_filtered.txt
+ - concatenate vcf files using vcf-tools
+  - https://vcftools.github.io/man_latest.html
+
+Input Data:
+ - "chr01-22_filtered.vcf"
+  - to download: https://drive.google.com/drive/folders/1O7cRyGbEHrkjiCAkaKODN0V2HYTDXjss
+
+To Do:
+- Outline first two weeks
+- Implement first two weeks notebooks (goals)
+- start first 3 mini lectures
+
+- plan (and test) which ML algorithms to introduce to students (clustering, standard prediction, etc)
+- brainstorm intro slides to SNPs
+- explore with python and benchmark runtimes
+
+Files:
+- 1000genomes_dataExploration.ipynb: preliminary exploration of 1000 Genomes data (PCA, SVM, data cleanup and filtering)