Diff of /README.md [000000] .. [1a0ad7]

Switch to unified view

a b/README.md
1
# AI4All_Genomics
2
3
AI4All@Princeton is a summer camp that aims to promote diversity in computer science by teaching AI to young students of diverse backgrounds (https://ai4all.princeton.edu).  In this module, we will be investigating our genomic diversity by exploring natural genomic variation between world populations.
4
5
Relevant Links:
6
- IGSR home: Genohttp://www.internationalgenome.org/home
7
- Phase 3 1000 Genomes Results: https://www.nature.com/articles/nature15393#abstract
8
- Phase 3 1000 Genomes Data: https://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/
9
- Phase 3 1000 Genomes Processing Example: https://bitbucket.org/remills/1000gp_sv_phase3/src/master/
10
  - *** see gwas_sv_ld_filt_af.txt 
11
  
12
Data Preprocessing
13
- Run GATK software to subset SNPs: https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.0.0/org_broadinstitute_hellbender_tools_walkers_variantutils_SelectVariants.php
14
  - run ./gatk/gatk SelectVariants -V input.vcf -O output.vcf --keep-ids gwas_sv_ld_RSIDs.list 
15
    - example output.vcf: chrX_filtered.txt
16
 - concatenate vcf files using vcf-tools
17
  - https://vcftools.github.io/man_latest.html
18
19
Input Data:
20
 - "chr01-22_filtered.vcf"
21
  - to download: https://drive.google.com/drive/folders/1O7cRyGbEHrkjiCAkaKODN0V2HYTDXjss
22
23
To Do:
24
- Outline first two weeks
25
- Implement first two weeks notebooks (goals)
26
- start first 3 mini lectures
27
28
- plan (and test) which ML algorithms to introduce to students (clustering, standard prediction, etc)
29
- brainstorm intro slides to SNPs
30
- explore with python and benchmark runtimes
31
32
Files:
33
- 1000genomes_dataExploration.ipynb: preliminary exploration of 1000 Genomes data (PCA, SVM, data cleanup and filtering)