# Common instructions

- Aim to attempt all the mandatory questions (marked with a *) in the problem set. 


- Only attempt optional question after you have attempted all the mandatory questions. 


- More credit will be given if you have successfully attempted all the mandatory questions, even if you do not attempt a single optional question, as opposed to missing even one mandatory question while attempting all the optional questions.


- With the above caveat, attempt as many questions as possible within the time period. Partially attempted questions will get partial credit.


- Normally, you should work through the problem set in ascending order (Q1 -> Q4).


- Clean, labeled plots and clear data interpretation will boost your score. So too, will the use of functions, meaningful variable names, and readable code.


- You have a maximum of 2 days to work on the assignment. We will not consider assignments submitted after the deadline. You are free to search the internet, but are not to discuss with others in any way or form, in pain of immediate disqualification.


- Report the websites used to obtain help. Before the deadline, create a single .zip file with all your code submit it in the submission link provided to you in the email. DO NOT include data in your zip file. 


- You can use any programming language of your choice to solve all or part of the questions, preferably notebooks like Jupyter Notebook, Google Colab etc. We should be able to execute your program(s) to generate the required data and plots.


- In case you are unable to complete some parts, clearly indicate how would you go about the task ? What steps would you try etc.


# Input data

- The TSV file `SampleData.tsv` has the following columns

    - <b>Sample</b>: Sample IDs (S01, S02, S03...)
    - <b>Treatment</b>: Information on sample type 
    
        - <i>HF+</i>: Blood plasma samples collected from coronary disease patients post major surgery who had a heart failure within 3 years of surgery
        - <i>HF-</i>: Blood plasma samples collected from coronary disease patients post major surgery who recovered post surgery without heart failure
        - <i>HVOL</i>: Blood plasma samples collected from individuals without any discernable coronary disease
        
        
    
- The gzipped file `GSE208194_RawTPM.csv.gz` contains gene expression information for the sample mentioned in `SampleData.tsv` where the file structure looks like

ENSEMBL ID        |S01      |S02      |S03      |S04
:-----------------|:--------|:--------|:--------|:--------
ENSG00000000419.12|2.398878 |12.157726|1.40211  |7.667875
ENSG00000000938.13|3.324077 |13.971038|1.917631 |10.225812
ENSG00000001629.10|12.037059|1.453811 |12.596614|15.799738
ENSG00000001631.15|1.287932 |5.842868 |1.412257 |1.526812

Where each row is a feature/gene (n=4150) and each column is a sample where the features are measured.


In [1]:
#Reading the SampleData file 
import pandas as pd
path1='/raid/home/ankitsingh1/StrandAi/SampleData.tsv'
SampleData_df = pd.read_csv(path1, sep='\t') 
SampleData_df

Unnamed: 0,sample,treatment
0,S01,HF-
1,S02,HF-
2,S03,HF-
3,S04,HF-
4,S05,HF-
...,...,...
87,S88,HVOL
88,S89,HVOL
89,S90,HVOL
90,S95,HF-


In [2]:
#Reading the Gene Expression
path2="/raid/home/ankitsingh1/StrandAi/GSE208194_RawTPM.csv"
Gene_exp_df= pd.read_csv(path2) 
Gene_exp_df

Unnamed: 0,ENSEMBL ID,S01,S02,S03,S04,S05,S06,S07,S08,S09,...,S83,S84,S85,S86,S87,S88,S89,S90,S95,S96
0,ENSG00000000419.12,2.398878,12.157726,1.402110,7.667875,4.198525,4.069709,5.573672,1.263016,0.000000,...,13.014835,4.824927,2.999915,7.365408,5.487723,25.222543,17.734630,5.057123,0.968332,3.757043
1,ENSG00000000938.13,3.324077,13.971038,1.917631,10.225812,2.847621,1.744719,5.592130,7.549874,8.696941,...,13.342512,1.015803,9.047389,4.224286,5.467746,17.525907,10.903512,1.257853,1.025641,8.876226
2,ENSG00000001629.10,12.037059,1.453811,12.596614,15.799738,17.246205,7.376407,7.310507,7.439008,23.898468,...,15.474715,24.929106,15.789732,22.172867,15.370133,89.995748,32.951226,12.263058,5.968981,14.964376
3,ENSG00000001631.15,1.287932,5.842868,1.412257,1.526812,7.261625,4.530184,0.000000,5.598395,5.140090,...,4.337378,4.804729,3.221380,3.440060,10.331736,7.899247,11.006974,6.447497,2.786651,5.343678
4,ENSG00000002549.12,2.914606,32.566404,2.231871,35.444900,9.212083,5.219885,4.155680,8.677364,29.230862,...,14.737366,8.268794,7.478753,6.658792,9.647899,28.691720,33.123814,6.709470,1.214385,3.023686
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4145,ENSG00000286834.1,1.446729,101.312666,17.106360,109.490990,33.096990,4.787487,30.199864,17.304912,29.437844,...,42.816479,6.306396,11.959351,4.344150,13.036536,23.370109,34.169966,5.444215,4.262073,13.917467
4146,ENSG00000287080.1,8.543461,48.482781,11.492971,40.763492,23.798352,8.696239,37.818536,18.158117,34.491123,...,21.921759,6.095591,1.362975,5.928143,0.704842,2.688918,9.531408,2.823952,0.107894,12.110910
4147,ENSG00000287160.1,3.392143,57.089139,7.379778,67.437197,17.519094,2.840147,10.407136,8.292886,5.282977,...,45.515759,3.203770,2.802123,1.286441,7.848437,23.094887,19.040575,1.324381,4.329463,9.970802
4148,ENSG00000287825.1,8.543410,26.198111,9.731645,5.939315,13.911370,14.807156,12.884268,0.353449,16.483560,...,0.000000,30.601166,18.953867,19.686259,20.540434,7.114563,4.664343,20.376778,1.601362,9.273306


# Feature filter*

- Remove genes/features with very low expression values (< 1) in 70% of samples (even if they are good features they cannot be used for diagnositics)
- Remove a member of feature pairs which are highly correlated
- Provide the statistics with respect to the removals

In [4]:

for index, row in Gene_exp_df.iterrows():
    print(row[])

ENSEMBL ID    ENSG00000000419.12
S01                     2.398878
S02                    12.157726
S03                      1.40211
S04                     7.667875
                     ...        
S88                    25.222543
S89                     17.73463
S90                     5.057123
S95                     0.968332
S96                     3.757043
Name: 0, Length: 90, dtype: object
ENSEMBL ID    ENSG00000000938.13
S01                     3.324077
S02                    13.971038
S03                     1.917631
S04                    10.225812
                     ...        
S88                    17.525907
S89                    10.903512
S90                     1.257853
S95                     1.025641
S96                     8.876226
Name: 1, Length: 90, dtype: object
ENSEMBL ID    ENSG00000001629.10
S01                    12.037059
S02                     1.453811
S03                    12.596614
S04                    15.799738
                     ...        
S88   

In [3]:
# Place your code here

genes=[]
genes_cols= Gene_exp_df.columns.values.tolist()

print(genes_cols)
genes_cols.remove("ENSEMBL ID")
print(genes_cols)

['ENSEMBL ID', 'S01', 'S02', 'S03', 'S04', 'S05', 'S06', 'S07', 'S08', 'S09', 'S10', 'S11', 'S12', 'S13', 'S14', 'S15', 'S16', 'S17', 'S18', 'S19', 'S20', 'S21', 'S22', 'S23', 'S24', 'S26', 'S27', 'S28', 'S29', 'S30', 'S31', 'S32', 'S33', 'S34', 'S35', 'S36', 'S37', 'S38', 'S39', 'S40', 'S41', 'S42', 'S44', 'S45', 'S46', 'S47', 'S48', 'S49', 'S50', 'S51', 'S52', 'S53', 'S54', 'S55', 'S56', 'S57', 'S58', 'S59', 'S60', 'S61', 'S62', 'S63', 'S64', 'S65', 'S66', 'S67', 'S68', 'S69', 'S70', 'S71', 'S72', 'S73', 'S74', 'S75', 'S77', 'S78', 'S79', 'S80', 'S81', 'S82', 'S83', 'S84', 'S85', 'S86', 'S87', 'S88', 'S89', 'S90', 'S95', 'S96']
['S01', 'S02', 'S03', 'S04', 'S05', 'S06', 'S07', 'S08', 'S09', 'S10', 'S11', 'S12', 'S13', 'S14', 'S15', 'S16', 'S17', 'S18', 'S19', 'S20', 'S21', 'S22', 'S23', 'S24', 'S26', 'S27', 'S28', 'S29', 'S30', 'S31', 'S32', 'S33', 'S34', 'S35', 'S36', 'S37', 'S38', 'S39', 'S40', 'S41', 'S42', 'S44', 'S45', 'S46', 'S47', 'S48', 'S49', 'S50', 'S51', 'S52', 'S53', 'S54

In [4]:
Gene_exp_df['S14']

0        3.852019
1        1.473123
2       14.680768
3        1.117000
4        4.570469
          ...    
4145     6.152377
4146    24.942787
4147     3.686996
4148    18.347209
4149     4.514240
Name: S14, Length: 4150, dtype: float64

In [5]:
Gene_exp_df.shape[0]

4150

In [6]:
#Gene_exp_df.shape[0]= 4150
thresold= 2905 #This is 70% of 4150 (total rows):
remove_genes=[]
new_genes=[]
for g in genes_cols:
    count=0
    # df=Gene_exp_df[g]
    for i in range(Gene_exp_df.shape[0]):
        if Gene_exp_df[g][i] <1 :
            count+=1
    if count >=thresold:
        remove_genes.append(g)
        print(count)
    else:
        new_genes.append(g)
        print(count)

107
63
59
21
62
16
90
38
290
52
52
33
104
48
80
74
109
6
320
80
132
41
158
74
35
99
80
40
67
28
29
13
129
91
120
47
84
129
47
238
39
69
45
174
95
127
28
132
38
85
162
65
41
207
13
57
320
91
246
3
8
64
42
15
9
12
106
18
15
36
13
11
8
3
9
24
36
8
154
118
22
15
8
3
45
23
73
456
18


In [7]:

remove_genes

[]

In [8]:
len(new_genes)

89

# Feature elimination*

From the feature matrix `GSE208194_RawTPM.csv.gz`, split the samples into training and test and on the training set 
- using recursive feature elimination identify the optimal number of features out of the 4150, which can predict the status of a particular individual to be 
    - Having the likelihood of coronary disease which can lead to heart failure (HF+)
    - Having the likelihood of coronary disease which may not lead to heart failure (HF-)
    - Healthy indiviual (HVOL)
    
- Show the relative accuracy of the model as different number of features are used as a plot.

In [None]:
# Place your code here

# Model building*

With N optimal features identified, used a feature selection method of choice to select the top N features and build a multi-class classifier with cross-validation and report its accuracy and any other metric deemed suitable.


In [None]:
# Place your code here

# Model evaluation*

- Apply the model on the test set and report the accuracy and per class sensitivity i.e the total number of samples in a given class within the test set, and how many were predicted accurately.
- Also report as a binary matrix how many samples were correctly identified as HF+/HF- and HVOL

|    |      HF+/-      |  HVOL |
|----------|:-------------:|------:|
| HF+/- |  . | . |
| HVOL |    .   | . |


In [None]:
# Place your code here

# Biological significance of features

- Take the top 5 features and search for their gene names in [Gene Cards](https://www.genecards.org/) with using the suffix for example:
  ENSG00000000419.12 -> ENSG00000000419	
  
- Take the gene names identified and search in [Pubmed](https://pubmed.ncbi.nlm.nih.gov/) as 
  "Gene name" AND "coronary disease"
  
- Report if anything of interest is found by reading the abstract of the top hits in pubmed.