Compare revisions

AOUAD Mohamed, Jad · AOUAD Mohamed, Jad · e77fa008 · e77fa008 · e77fa008
--- a/.ipynb_checkpoints/draft_kidney-checkpoint.ipynb
+++ b/.ipynb_checkpoints/draft_kidney-checkpoint.ipynb
--- a/binary_classification_workflow.py
+++ b/binary_classification_workflow.py
@@ -3,7 +3,7 @@
 Spyder Editor

 This file contains the preprocessing functions needed to clean 
-and prepare the data. We first consider the data related to kidney diseases.
+and prepare the data.
 """

 import seaborn as sns
@@ -24,6 +24,7 @@ from binary_classification_workflow import *


 """
+kideney data
 data description : 25 features  ( 11  numeric ,14  nominal)
    Numerical Data (11):
        1. age: Age in years
@@ -400,7 +401,8 @@ def split(df, target,alpha=0.2,n=5):

 def convert_categorical_feats(df, categorical_cols):
    """
-    Encode the categorical features of the dataset using OrdinalEncoder and OneHotEncoder.
+    Encode the categorical features of the dataset using OrdinalEncoder 
+    and OneHotEncoder.

    Parameters:
    ----------

--- a/draft_kidney.ipynb
+++ b/draft_kidney.ipynb
 %% Cell type:code id: tags:
  
 ``` python
 import pandas as pd
 import matplotlib.pyplot as plt
 from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
 from sklearn.linear_model import LogisticRegression
 from binary_classification_workflow import *
 ```
  
 %% Cell type:code id: tags:
  
 ``` python
 df_kidney = pd.read_csv('./data/kidney_disease.csv')
 df_kidney.info()
  
 nan_count = df_kidney[df_kidney.isna().any(axis=1)].shape[0]
 print(f"Number of rows : {len(df_kidney)}")
 print(f"Number of rows with at least one NAN value: {nan_count}")
 print(f"{round(nan_count/len(df_kidney) * 100)}% of our rows have at least one"
      f" missing value")
 ```
  
 %% Output
  
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 400 entries, 0 to 399
    Data columns (total 26 columns):
     #   Column          Non-Null Count  Dtype
    ---  ------          --------------  -----
     0   id              400 non-null    int64
     1   age             391 non-null    float64
     2   bp              388 non-null    float64
     3   sg              353 non-null    float64
     4   al              354 non-null    float64
     5   su              351 non-null    float64
     6   rbc             248 non-null    object
     7   pc              335 non-null    object
     8   pcc             396 non-null    object
     9   ba              396 non-null    object
     10  bgr             356 non-null    float64
     11  bu              381 non-null    float64
     12  sc              383 non-null    float64
     13  sod             313 non-null    float64
     14  pot             312 non-null    float64
     15  hemo            348 non-null    float64
     16  pcv             330 non-null    object
     17  wc              295 non-null    object
     18  rc              270 non-null    object
     19  htn             398 non-null    object
     20  dm              398 non-null    object
     21  cad             398 non-null    object
     22  appet           399 non-null    object
     23  pe              399 non-null    object
     24  ane             399 non-null    object
     25  classification  400 non-null    object
    dtypes: float64(11), int64(1), object(14)
    memory usage: 81.4+ KB
    Number of rows : 400
    Number of rows with at least one NAN value: 242
    60% of our rows have at least one missing value
  
 %% Cell type:code id: tags:
  
 ``` python
 df_kidney.sample(5)
 ```
  
 %% Output
  
          id   age     bp     sg   al   su       rbc        pc         pcc  \
    367  367  68.0   60.0  1.025  0.0  0.0    normal    normal  notpresent
    100  100  34.0   70.0  1.015  4.0  0.0  abnormal  abnormal  notpresent
    67    67  45.0   80.0  1.020  3.0  0.0    normal  abnormal  notpresent
    76    76  48.0   80.0  1.005  4.0  0.0  abnormal  abnormal  notpresent
    133  133  70.0  100.0  1.015  4.0  0.0    normal    normal  notpresent
    
                 ba  ...  pcv      wc   rc  htn   dm  cad appet   pe ane  \
    367  notpresent  ...   50    6700  6.1   no   no   no  good   no  no
    100  notpresent  ...  NaN     NaN  NaN   no   no   no  good  yes  no
    67   notpresent  ...  NaN     NaN  NaN   no   no   no  poor   no  no
    76      present  ...   36  \t6200    4   no  yes   no  good  yes  no
    133  notpresent  ...   37  \t8400  8.0  yes   no   no  good   no  no
    
        classification
    367         notckd
    100            ckd
    67             ckd
    76             ckd
    133            ckd
    
    [5 rows x 26 columns]
  
 %% Cell type:code id: tags:
  
 ``` python
 numerical_columns = get_numerical_columns(df_kidney)
 nominal_columns = get_categorical_columns(df_kidney)
 ```
  
 %% Cell type:code id: tags:
  
 ``` python
 ##
 print(numerical_columns,
 nominal_columns)
 ```
  
 %% Output
  
    ['id', 'age', 'bp', 'sg', 'al', 'su', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hemo'] ['rbc', 'pc', 'pcc', 'ba', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane', 'classification']
  
 %% Cell type:code id: tags:
  
 ``` python
 # visualise_numerical_data(df_kidney)
 visualise_numerical_data(df_kidney,columns=numerical_columns)
 ```
  
 %% Output
  

  

  

  

  

  

  

  

  

  

  

  

  
 %% Cell type:code id: tags:
  
 ``` python
 fill_categorical_kidney(df_kidney,nominal_columns)
 df_kidney.info()
 nan_count = df_kidney[df_kidney.isna().any(axis=1)].shape[0]
 print(f"Number of rows : {len(df_kidney)}")
 print(f"Number of rows with at least one NaN value: {nan_count}")
 print(f"{round(nan_count/len(df_kidney) * 100)}% of our rows have at least one"
      f" missing value")
 ```
  
 %% Output
  
    Going through each categorical feature...: 100%|██████████| 14/14 [00:00<00:00, 353.70it/s]
  
    
    Processing column: rbc
    Possible categories and their frequencies:
    rbc
    normal      0.810484
    abnormal    0.189516
    Name: proportion, dtype: float64
    
    Processing column: pc
    Possible categories and their frequencies:
    pc
    normal      0.773134
    abnormal    0.226866
    Name: proportion, dtype: float64
    
    Processing column: pcc
    Possible categories and their frequencies:
    pcc
    notpresent    0.893939
    present       0.106061
    Name: proportion, dtype: float64
    
    Processing column: ba
    Possible categories and their frequencies:
    ba
    notpresent    0.944444
    present       0.055556
    Name: proportion, dtype: float64
    
    Processing column: pcv
    Possible categories and their frequencies:
    pcv
    41    0.063830
    52    0.063830
    44    0.057751
    48    0.057751
    40    0.048632
    43    0.045593
    42    0.039514
    45    0.039514
    32    0.036474
    50    0.036474
    36    0.036474
    33    0.036474
    28    0.036474
    34    0.033435
    37    0.033435
    30    0.027356
    29    0.027356
    35    0.027356
    46    0.027356
    31    0.024316
    24    0.021277
    39    0.021277
    26    0.018237
    38    0.015198
    53    0.012158
    51    0.012158
    49    0.012158
    47    0.012158
    54    0.012158
    25    0.009119
    27    0.009119
    22    0.009119
    19    0.006079
    23    0.006079
    15    0.003040
    21    0.003040
    20    0.003040
    17    0.003040
    9     0.003040
    18    0.003040
    14    0.003040
    16    0.003040
    Name: proportion, dtype: float64
    
    Processing column: wc
    Possible categories and their frequencies:
    wc
    9800     0.037415
    6700     0.034014
    9600     0.030612
    7200     0.030612
    9200     0.030612
               ...
    19100    0.003401
    12300    0.003401
    16700    0.003401
    14900    0.003401
    2600     0.003401
    Name: proportion, Length: 89, dtype: float64
    
    Processing column: rc
    Possible categories and their frequencies:
    rc
    5.2    0.066914
    4.5    0.059480
    4.9    0.052045
    4.7    0.040892
    4.8    0.037175
    3.9    0.037175
    4.6    0.033457
    3.4    0.033457
    5.9    0.029740
    5.5    0.029740
    6.1    0.029740
    5.0    0.029740
    3.7    0.029740
    5.3    0.026022
    5.8    0.026022
    5.4    0.026022
    3.8    0.026022
    5.6    0.022305
    4.3    0.022305
    4.2    0.022305
    3.2    0.018587
    4.4    0.018587
    5.7    0.018587
    6.4    0.018587
    5.1    0.018587
    6.2    0.018587
    6.5    0.018587
    4.1    0.018587
    3.6    0.014870
    6.0    0.014870
    6.3    0.014870
    4.0    0.011152
    3.5    0.011152
    3.3    0.011152
    4      0.011152
    5      0.007435
    3.1    0.007435
    2.6    0.007435
    2.1    0.007435
    2.9    0.007435
    2.5    0.007435
    3.0    0.007435
    2.7    0.007435
    2.8    0.007435
    2.3    0.003717
    2.4    0.003717
    3      0.003717
    8.0    0.003717
    Name: proportion, dtype: float64
    
    Processing column: htn
    Possible categories and their frequencies:
    htn
    no     0.630653
    yes    0.369347
    Name: proportion, dtype: float64
    
    Processing column: dm
    Possible categories and their frequencies:
    dm
    no     0.655779
    yes    0.344221
    Name: proportion, dtype: float64
    
    Processing column: cad
    Possible categories and their frequencies:
    cad
    no     0.914573
    yes    0.085427
    Name: proportion, dtype: float64
    
    Processing column: appet
    Possible categories and their frequencies:
    appet
    good    0.794486
    poor    0.205514
    Name: proportion, dtype: float64
    
    Processing column: pe
    Possible categories and their frequencies:
    pe
    no     0.809524
    yes    0.190476
    Name: proportion, dtype: float64
    
    Processing column: ane
    Possible categories and their frequencies:
    ane
    no     0.849624
    yes    0.150376
    Name: proportion, dtype: float64
    
    Processing column: classification
    Possible categories and their frequencies:
    classification
    ckd       0.625
    notckd    0.375
    Name: proportion, dtype: float64
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 400 entries, 0 to 399
    Data columns (total 26 columns):
     #   Column          Non-Null Count  Dtype
    ---  ------          --------------  -----
     0   id              400 non-null    int64
     1   age             391 non-null    float64
     2   bp              388 non-null    float64
     3   sg              353 non-null    float64
     4   al              354 non-null    float64
     5   su              351 non-null    float64
     6   rbc             400 non-null    object
     7   pc              400 non-null    object
     8   pcc             400 non-null    object
     9   ba              400 non-null    object
     10  bgr             356 non-null    float64
     11  bu              381 non-null    float64
     12  sc              383 non-null    float64
     13  sod             313 non-null    float64
     14  pot             312 non-null    float64
     15  hemo            348 non-null    float64
     16  pcv             400 non-null    object
     17  wc              400 non-null    object
     18  rc              400 non-null    object
     19  htn             400 non-null    object
     20  dm              400 non-null    object
     21  cad             400 non-null    object
     22  appet           400 non-null    object
     23  pe              400 non-null    object
     24  ane             400 non-null    object
     25  classification  400 non-null    object
    dtypes: float64(11), int64(1), object(14)
    memory usage: 81.4+ KB
    Number of rows : 400
    Number of rows with at least one NaN value: 172
    43% of our rows have at least one missing value
  
    
  
 %% Cell type:code id: tags:
  
 ``` python
 # Example usage
 scale_normalize(df_kidney,numerical_columns)
 ```
  
 %% Output
  
    #######BEFORE SCALING AND NORMALIZING########
                   id         age          bp          sg          al          su  \
    count  400.000000  391.000000  388.000000  353.000000  354.000000  351.000000
    mean   199.500000   51.483376   76.469072    1.017408    1.016949    0.450142
    std    115.614301   17.169714   13.683637    0.005717    1.352679    1.099191
    min      0.000000    2.000000   50.000000    1.005000    0.000000    0.000000
    25%     99.750000   42.000000   70.000000    1.010000    0.000000    0.000000
    50%    199.500000   55.000000   80.000000    1.020000    0.000000    0.000000
    75%    299.250000   64.500000   80.000000    1.020000    2.000000    0.000000
    max    399.000000   90.000000  180.000000    1.025000    5.000000    5.000000
    
                  bgr          bu          sc         sod         pot        hemo
    count  356.000000  381.000000  383.000000  313.000000  312.000000  348.000000
    mean   148.036517   57.425722    3.072454  137.528754    4.627244   12.526437
    std     79.281714   50.503006    5.741126   10.408752    3.193904    2.912587
    min     22.000000    1.500000    0.400000    4.500000    2.500000    3.100000
    25%     99.000000   27.000000    0.900000  135.000000    3.800000   10.300000
    50%    121.000000   42.000000    1.300000  138.000000    4.400000   12.650000
    75%    163.000000   66.000000    2.800000  142.000000    4.900000   15.000000
    max    490.000000  391.000000   76.000000  163.000000   47.000000   17.800000
    #######AFTER SCALING AND NORMALIZING########
                     id           age            bp            sg            al  \
    count  4.000000e+02  3.910000e+02  3.880000e+02  3.530000e+02  3.540000e+02
    mean  -1.421085e-16  1.272071e-16  2.197555e-16  3.220590e-16  8.028731e-17
    std    1.001252e+00  1.001281e+00  1.001291e+00  1.001419e+00  1.001415e+00
    min   -1.727726e+00 -2.885708e+00 -1.936857e+00 -2.173584e+00 -7.528679e-01
    25%   -8.638630e-01 -5.530393e-01 -4.733701e-01 -1.297699e+00 -7.528679e-01
    50%   -9.540979e-17  2.050779e-01  2.583733e-01  4.540705e-01 -7.528679e-01
    75%    8.638630e-01  7.590867e-01  2.583733e-01  4.540705e-01  7.277723e-01
    max    1.727726e+00  2.246163e+00  7.575807e+00  1.329955e+00  2.948733e+00
    
                     su           bgr            bu            sc           sod  \
    count  3.510000e+02  3.560000e+02  3.810000e+02  3.830000e+02  3.130000e+02
    mean   2.024338e-17  1.596725e-16  5.594825e-17  1.855203e-17 -1.021547e-15
    std    1.001428e+00  1.001407e+00  1.001315e+00  1.001308e+00  1.001601e+00
    min   -4.101061e-01 -1.591967e+00 -1.108830e+00 -4.661019e-01 -1.280094e+01
    25%   -4.101061e-01 -6.193803e-01 -6.032459e-01 -3.788971e-01 -2.433340e-01
    50%   -4.101061e-01 -3.414983e-01 -3.058433e-01 -3.091332e-01  4.534651e-02
    75%   -4.101061e-01  1.890038e-01  1.700008e-01 -4.751867e-02  4.302539e-01
    max    4.145186e+00  4.319341e+00  6.613723e+00  1.271927e+01  2.451017e+00
    
                    pot          hemo
    count  3.120000e+02  3.480000e+02
    mean  -4.554761e-17 -2.858505e-16
    std    1.001606e+00  1.001440e+00
    min   -6.671023e-01 -3.241109e+00
    25%   -2.594231e-01 -7.655198e-01
    50%   -7.126345e-02  4.248496e-02
    75%    8.553625e-02  8.504897e-01
    max    1.328807e+01  1.813219e+00
  
 %% Cell type:code id: tags:
  
 ``` python
 nominal_columns = get_categorical_columns(df_kidney)
 df_kidney = convert_categorical_feats(df_kidney, nominal_columns)
 ```
  
 %% Output
  
    /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/sklearn/preprocessing/_encoders.py:972: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
      warnings.warn(
    /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/sklearn/preprocessing/_encoders.py:972: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
      warnings.warn(
    /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/sklearn/preprocessing/_encoders.py:972: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
      warnings.warn(
  
 %% Cell type:code id: tags:
  
 ``` python
 fill_numerical_columns(df_kidney, skew_threshold=0.5)
 ```
  
 %% Output
  
    id done !
    age done !
    bp done !
    sg done !
    al done !
    su done !
    rbc done !
    pc done !
    pcc done !
    ba done !
    bgr done !
    bu done !
    sc done !
    sod done !
    pot done !
    hemo done !
    htn done !
    dm done !
    cad done !
    appet done !
    pe done !
    ane done !
    classification done !
    pcv_14 done !
    pcv_15 done !
    pcv_16 done !
    pcv_17 done !
    pcv_18 done !
    pcv_19 done !
    pcv_20 done !
    pcv_21 done !
    pcv_22 done !
    pcv_23 done !
    pcv_24 done !
    pcv_25 done !
    pcv_26 done !
    pcv_27 done !
    pcv_28 done !
    pcv_29 done !
    pcv_30 done !
    pcv_31 done !
    pcv_32 done !
    pcv_33 done !
    pcv_34 done !
    pcv_35 done !
    pcv_36 done !
    pcv_37 done !
    pcv_38 done !
    pcv_39 done !
    pcv_40 done !
    pcv_41 done !
    pcv_42 done !
    pcv_43 done !
    pcv_44 done !
    pcv_45 done !
    pcv_46 done !
    pcv_47 done !
    pcv_48 done !
    pcv_49 done !
    pcv_50 done !
    pcv_51 done !
    pcv_52 done !
    pcv_53 done !
    pcv_54 done !
    pcv_9 done !
    wc_10200 done !
    wc_10300 done !
    wc_10400 done !
    wc_10500 done !
    wc_10700 done !
    wc_10800 done !
    wc_10900 done !
    wc_11000 done !
    wc_11200 done !
    wc_11300 done !
    wc_11400 done !
    wc_11500 done !
    wc_11800 done !
    wc_11900 done !
    wc_12000 done !
    wc_12100 done !
    wc_12200 done !
    wc_12300 done !
    wc_12400 done !
    wc_12500 done !
    wc_12700 done !
    wc_12800 done !
    wc_13200 done !
    wc_13600 done !
    wc_14600 done !
    wc_14900 done !
    wc_15200 done !
    wc_15700 done !
    wc_16300 done !
    wc_16700 done !
    wc_18900 done !
    wc_19100 done !
    wc_21600 done !
    wc_2200 done !
    wc_2600 done !
    wc_26400 done !
    wc_3800 done !
    wc_4100 done !
    wc_4200 done !
    wc_4300 done !
    wc_4500 done !
    wc_4700 done !
    wc_4900 done !
    wc_5000 done !
    wc_5100 done !
    wc_5200 done !
    wc_5300 done !
    wc_5400 done !
    wc_5500 done !
    wc_5600 done !
    wc_5700 done !
    wc_5800 done !
    wc_5900 done !
    wc_6000 done !
    wc_6200 done !
    wc_6300 done !
    wc_6400 done !
    wc_6500 done !
    wc_6600 done !
    wc_6700 done !
    wc_6800 done !
    wc_6900 done !
    wc_7000 done !
    wc_7100 done !
    wc_7200 done !
    wc_7300 done !
    wc_7400 done !
    wc_7500 done !
    wc_7700 done !
    wc_7800 done !
    wc_7900 done !
    wc_8000 done !
    wc_8100 done !
    wc_8200 done !
    wc_8300 done !
    wc_8400 done !
    wc_8500 done !
    wc_8600 done !
    wc_8800 done !
    wc_9000 done !
    wc_9100 done !
    wc_9200 done !
    wc_9300 done !
    wc_9400 done !
    wc_9500 done !
    wc_9600 done !
    wc_9700 done !
    wc_9800 done !
    wc_9900 done !
    rc_2.1 done !
    rc_2.3 done !
    rc_2.4 done !
    rc_2.5 done !
    rc_2.6 done !
    rc_2.7 done !
    rc_2.8 done !
    rc_2.9 done !
    rc_3 done !
    rc_3.0 done !
    rc_3.1 done !
    rc_3.2 done !
    rc_3.3 done !
    rc_3.4 done !
    rc_3.5 done !
    rc_3.6 done !
    rc_3.7 done !
    rc_3.8 done !
    rc_3.9 done !
    rc_4 done !
    rc_4.0 done !
    rc_4.1 done !
    rc_4.2 done !
    rc_4.3 done !
    rc_4.4 done !
    rc_4.5 done !
    rc_4.6 done !
    rc_4.7 done !
    rc_4.8 done !
    rc_4.9 done !
    rc_5 done !
    rc_5.0 done !
    rc_5.1 done !
    rc_5.2 done !
    rc_5.3 done !
    rc_5.4 done !
    rc_5.5 done !
    rc_5.6 done !
    rc_5.7 done !
    rc_5.8 done !
    rc_5.9 done !
    rc_6.0 done !
    rc_6.1 done !
    rc_6.2 done !
    rc_6.3 done !
    rc_6.4 done !
    rc_6.5 done !
  
 %% Cell type:code id: tags:
  
 ``` python
 df_kidney.sample()
 ```
  
 %% Output
  
              id       age       bp        sg        al        su  rbc  pc  pcc  \
    84 -1.000262  0.438345 -0.47337 -1.297699  1.468092 -0.410106    1   0    0
    
        ba  ...  rc_5.7  rc_5.8  rc_5.9  rc_6.0  rc_6.1  rc_6.2  rc_6.3  rc_6.4  \
    84   0  ...       0       0       0       0       0       0       0       0
    
        rc_6.5  rc_8.0
    84       0       0
    
    [1 rows x 202 columns]
  
 %% Cell type:code id: tags:
  
 ``` python
 df_kidney.isna().sum()
 ```
  
 %% Output
  
    id        0
    age       0
    bp        0
    sg        0
    al        0
             ..
    rc_6.2    0
    rc_6.3    0
    rc_6.4    0
    rc_6.5    0
    rc_8.0    0
    Length: 202, dtype: int64
  
 %% Cell type:code id: tags:
  
 ``` python
 ##
 df_kidney.sample()
 ```
  
 %% Output
  
               id       age        bp        sg        al        su  rbc  pc  pcc  \
    354  1.338013 -1.136206 -1.205114  1.329955 -0.752868 -0.410106    1   1    0
    
         ba  ...  rc_5.7  rc_5.8  rc_5.9  rc_6.0  rc_6.1  rc_6.2  rc_6.3  rc_6.4  \
    354   0  ...       0       0       0       0       0       0       0       0
    
         rc_6.5  rc_8.0
    354       0       0
    
    [1 rows x 202 columns]
  
 %% Cell type:code id: tags:
  
 ``` python
 df_kidney['classification'].value_counts()
 ```
  
 %% Output
  
    classification
    0    250
    1    150
    Name: count, dtype: int64
  
 %% Cell type:code id: tags:
  
 ``` python
 result_df, explainable_ratios= feature_selection(df_kidney, 'classification', threshold_variance_ratio=0.90)
 ```
  
 %% Cell type:code id: tags:
  
 ``` python
 result_df['classification'].value_counts()
 ```
  
 %% Output
  
    classification
    0    250
    1    150
    Name: count, dtype: int64
  
 %% Cell type:code id: tags:
  
 ``` python
 X_train, X_test, y_train, y_test, cv = split(result_df, 'classification',alpha=0.2,n=5)
 ```
  
 %% Cell type:code id: tags:
  
 ``` python
 fig, axes = plt.subplots(1, 2, figsize=(10, 4))
 axes[0].scatter(result_df.loc[result_df['classification'] == 0, 'PCA1'], result_df.loc[result_df['classification'] == 0, 'PCA2'], color='red', label="CKD")
 axes[0].scatter(result_df.loc[result_df['classification'] == 1, 'PCA1'], result_df.loc[result_df['classification'] == 1, 'PCA2'], color='green', label="NOT CKD")
 axes[0].set_xlabel("PCA1: First component")
 axes[0].set_ylabel("PCA2: Second component")
 axes[0].legend()
  
 axes[1].plot(range(1,len(explainable_ratios)+1), explainable_ratios)
 axes[1].axhline(y=0.90, linestyle='--', color='red', label='Threshold Ratio')
 axes[1].set_xlabel('Number of components')
 axes[1].set_ylabel('Cumulative Explainable variance')
  
 plt.tight_layout()
 plt.subplots_adjust(wspace=0.4)
 plt.show()
  
 #The two classes are distinguishable if we project the feature space onto the first two PCA's eigenvectors
 ```
  
 %% Output
  

  
 %% Cell type:code id: tags:
  
 ``` python
 dict_models = {
    'RandomForestClassifier': {
        'model': RandomForestClassifier(),
        'param_grid': {
            'n_estimators': [50, 100, 150, 200],
            'criterion': ['gini', 'entropy'],
            'max_depth': [None, 10, 20],
            'bootstrap': [True, False]
        }
    },
    'Logistic Regression': {
        'model': LogisticRegression(),
        'param_grid': {
            'C': [0.001, 0.01, 0.1, 1, 10, 100],
            'fit_intercept': [True, False],
            'intercept_scaling': [1, 10, 100]
        }
    },
    'AdaBoostClassifier': {
        'model': AdaBoostClassifier(),
        'param_grid': {
            'n_estimators': [50, 100, 150, 200],
            'algorithm': ['SAMME', 'SAMME.R'],
            'learning_rate': [0.01, 0.1, 0.5, 1]
        }
    }
 }
 ```
  
 %% Cell type:code id: tags:
  
 ``` python
 display_results(dict_models, X_train, y_train, X_test, y_test, cv, 'f1 scoring on Kidney data(%)')
 ```
  
 %% Output
  
    Going through each model defined in the dictionnary...:   0%|          | 0/3 [00:00<?, ?it/s]
  
    Fitting 5 folds for each of 48 candidates, totalling 240 fits
  
    /Users/ilyaschahed/git/mini-projet-intro-ml/binary_classification_workflow.py:500: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
      df_results = pd.concat([df_results, pd.DataFrame([new_row])], ignore_index=True)
  

  
    Model: RandomForestClassifier
    Accuracy: 0.975
    Precision: 1.0
    Recall: 0.9333333333333333
    ROC-AUC: 1.0
    
    
  

  
    Going through each model defined in the dictionnary...:  33%|███▎      | 1/3 [00:46<01:32, 46.18s/it]
  
    Fitting 5 folds for each of 36 candidates, totalling 180 fits
  

  
    Model: Logistic Regression
    Accuracy: 1.0
    Precision: 1.0
    Recall: 1.0
    ROC-AUC: 1.0
    
    
  

  
    Going through each model defined in the dictionnary...:  67%|██████▋   | 2/3 [00:48<00:20, 20.11s/it]
  
    Fitting 5 folds for each of 32 candidates, totalling 160 fits
  

  
    Model: AdaBoostClassifier
    Accuracy: 1.0
    Precision: 1.0
    Recall: 1.0
    ROC-AUC: 1.0
    
    
  

  
    Going through each model defined in the dictionnary...: 100%|██████████| 3/3 [00:49<00:00, 16.64s/it]
  
    <pandas.io.formats.style.Styler at 0x12e47c450>
  
-%% Cell type:code id: tags:
+%% Cell type:markdown id: tags:
  
-``` python
-```
+# Code good practices
+
+%% Cell type:markdown id: tags:
+
+Programming is not just about writing code that works; it's also about writing code that is maintainable, readable, and efficient. Good programming practices contribute to the overall quality of code, making it easier to understand, modify, and collaborate on. Here are some essential good programming practices that we tried to follow in our work.
+
+%% Cell type:markdown id: tags:
+
+### Code redability
+
+%% Cell type:markdown id: tags:
+
+- Use meaningful variable and function names: Choose names that clearly convey the purpose of the variable or function.
+- Maintain consistent line length: try to avoid lines longer than 80-120 characters
+
+%% Cell type:markdown id: tags:
+
+### Modularity
+
+%% Cell type:markdown id: tags:
+
+- Break code into functions or classes: Divide your code into smaller, reusable modules. This promotes code reuse and makes it easier to understand.
+
+%% Cell type:markdown id: tags:
+
+### Comments and documentation
+
+%% Cell type:markdown id: tags:
+
+- Write clear comments: Use comments to explain complex logic, assumptions, or any non-obvious aspects of your code.
+- Provide documentation: Include docstrings to describe the purpose, parameters, and return values of functions or methods.
+
+%% Cell type:markdown id: tags:
+
+###  Other good practices
+
+%% Cell type:markdown id: tags:
+
+- Implement proper error handling: Anticipate and handle exceptions gracefully to prevent unexpected crashes.
+- Use version control systems (e.g., Git): Keep track of changes, collaborate with others, and easily revert to previous versions if needed.
+- Optimize when necessary: Identify bottlenecks and optimize critical sections of your code. However, prioritize readability over premature optimization.
+
+%% Cell type:markdown id: tags:
+
+### Conclusion
+
+%% Cell type:markdown id: tags:
+
+Adhering to good programming practices is crucial for writing code that is not only functional but also maintainable, scalable, and collaborative. By following these practices, you contribute to the creation of high-quality software that stands the test of time. Remember, writing code is not just about solving a problem; it's about solving it in the best possible way.

 %% Cell type:code id: tags:
  
 ``` python
 import pandas as pd
 import matplotlib.pyplot as plt
 from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
 from sklearn.linear_model import LogisticRegression
 from binary_classification_workflow import *
 ```
  
 %% Cell type:code id: tags:
  
 ``` python
 df_kidney = pd.read_csv('./data/kidney_disease.csv')
 df_kidney.info()
  
 nan_count = df_kidney[df_kidney.isna().any(axis=1)].shape[0]
 print(f"Number of rows : {len(df_kidney)}")
 print(f"Number of rows with at least one NAN value: {nan_count}")
 print(f"{round(nan_count/len(df_kidney) * 100)}% of our rows have at least one"
      f" missing value")
 ```
  
 %% Output
  
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 400 entries, 0 to 399
    Data columns (total 26 columns):
     #   Column          Non-Null Count  Dtype
    ---  ------          --------------  -----
     0   id              400 non-null    int64
     1   age             391 non-null    float64
     2   bp              388 non-null    float64
     3   sg              353 non-null    float64
     4   al              354 non-null    float64
     5   su              351 non-null    float64
     6   rbc             248 non-null    object
     7   pc              335 non-null    object
     8   pcc             396 non-null    object
     9   ba              396 non-null    object
     10  bgr             356 non-null    float64
     11  bu              381 non-null    float64
     12  sc              383 non-null    float64
     13  sod             313 non-null    float64
     14  pot             312 non-null    float64
     15  hemo            348 non-null    float64
     16  pcv             330 non-null    object
     17  wc              295 non-null    object
     18  rc              270 non-null    object
     19  htn             398 non-null    object
     20  dm              398 non-null    object
     21  cad             398 non-null    object
     22  appet           399 non-null    object
     23  pe              399 non-null    object
     24  ane             399 non-null    object
     25  classification  400 non-null    object
    dtypes: float64(11), int64(1), object(14)
    memory usage: 81.4+ KB
    Number of rows : 400
    Number of rows with at least one NAN value: 242
    60% of our rows have at least one missing value
  
 %% Cell type:code id: tags:
  
 ``` python
 df_kidney.sample(5)
 ```
  
 %% Output
  
          id   age     bp     sg   al   su       rbc        pc         pcc  \
    367  367  68.0   60.0  1.025  0.0  0.0    normal    normal  notpresent
    100  100  34.0   70.0  1.015  4.0  0.0  abnormal  abnormal  notpresent
    67    67  45.0   80.0  1.020  3.0  0.0    normal  abnormal  notpresent
    76    76  48.0   80.0  1.005  4.0  0.0  abnormal  abnormal  notpresent
    133  133  70.0  100.0  1.015  4.0  0.0    normal    normal  notpresent
    
                 ba  ...  pcv      wc   rc  htn   dm  cad appet   pe ane  \
    367  notpresent  ...   50    6700  6.1   no   no   no  good   no  no
    100  notpresent  ...  NaN     NaN  NaN   no   no   no  good  yes  no
    67   notpresent  ...  NaN     NaN  NaN   no   no   no  poor   no  no
    76      present  ...   36  \t6200    4   no  yes   no  good  yes  no
    133  notpresent  ...   37  \t8400  8.0  yes   no   no  good   no  no
    
        classification
    367         notckd
    100            ckd
    67             ckd
    76             ckd
    133            ckd
    
    [5 rows x 26 columns]
  
 %% Cell type:code id: tags:
  
 ``` python
 numerical_columns = get_numerical_columns(df_kidney)
 nominal_columns = get_categorical_columns(df_kidney)
 ```
  
 %% Cell type:code id: tags:
  
 ``` python
 ##
 print(numerical_columns,
 nominal_columns)
 ```
  
 %% Output
  
    ['id', 'age', 'bp', 'sg', 'al', 'su', 'bgr', 'bu', 'sc', 'sod', 'pot', 'hemo'] ['rbc', 'pc', 'pcc', 'ba', 'pcv', 'wc', 'rc', 'htn', 'dm', 'cad', 'appet', 'pe', 'ane', 'classification']
  
 %% Cell type:code id: tags:
  
 ``` python
 # visualise_numerical_data(df_kidney)
 visualise_numerical_data(df_kidney,columns=numerical_columns)
 ```
  
 %% Output
  

  

  

  

  

  

  

  

  

  

  

  

  
 %% Cell type:code id: tags:
  
 ``` python
 fill_categorical_kidney(df_kidney,nominal_columns)
 df_kidney.info()
 nan_count = df_kidney[df_kidney.isna().any(axis=1)].shape[0]
 print(f"Number of rows : {len(df_kidney)}")
 print(f"Number of rows with at least one NaN value: {nan_count}")
 print(f"{round(nan_count/len(df_kidney) * 100)}% of our rows have at least one"
      f" missing value")
 ```
  
 %% Output
  
    Going through each categorical feature...: 100%|██████████| 14/14 [00:00<00:00, 353.70it/s]
  
    
    Processing column: rbc
    Possible categories and their frequencies:
    rbc
    normal      0.810484
    abnormal    0.189516
    Name: proportion, dtype: float64
    
    Processing column: pc
    Possible categories and their frequencies:
    pc
    normal      0.773134
    abnormal    0.226866
    Name: proportion, dtype: float64
    
    Processing column: pcc
    Possible categories and their frequencies:
    pcc
    notpresent    0.893939
    present       0.106061
    Name: proportion, dtype: float64
    
    Processing column: ba
    Possible categories and their frequencies:
    ba
    notpresent    0.944444
    present       0.055556
    Name: proportion, dtype: float64
    
    Processing column: pcv
    Possible categories and their frequencies:
    pcv
    41    0.063830
    52    0.063830
    44    0.057751
    48    0.057751
    40    0.048632
    43    0.045593
    42    0.039514
    45    0.039514
    32    0.036474
    50    0.036474
    36    0.036474
    33    0.036474
    28    0.036474
    34    0.033435
    37    0.033435
    30    0.027356
    29    0.027356
    35    0.027356
    46    0.027356
    31    0.024316
    24    0.021277
    39    0.021277
    26    0.018237
    38    0.015198
    53    0.012158
    51    0.012158
    49    0.012158
    47    0.012158
    54    0.012158
    25    0.009119
    27    0.009119
    22    0.009119
    19    0.006079
    23    0.006079
    15    0.003040
    21    0.003040
    20    0.003040
    17    0.003040
    9     0.003040
    18    0.003040
    14    0.003040
    16    0.003040
    Name: proportion, dtype: float64
    
    Processing column: wc
    Possible categories and their frequencies:
    wc
    9800     0.037415
    6700     0.034014
    9600     0.030612
    7200     0.030612
    9200     0.030612
               ...
    19100    0.003401
    12300    0.003401
    16700    0.003401
    14900    0.003401
    2600     0.003401
    Name: proportion, Length: 89, dtype: float64
    
    Processing column: rc
    Possible categories and their frequencies:
    rc
    5.2    0.066914
    4.5    0.059480
    4.9    0.052045
    4.7    0.040892
    4.8    0.037175
    3.9    0.037175
    4.6    0.033457
    3.4    0.033457
    5.9    0.029740
    5.5    0.029740
    6.1    0.029740
    5.0    0.029740
    3.7    0.029740
    5.3    0.026022
    5.8    0.026022
    5.4    0.026022
    3.8    0.026022
    5.6    0.022305
    4.3    0.022305
    4.2    0.022305
    3.2    0.018587
    4.4    0.018587
    5.7    0.018587
    6.4    0.018587
    5.1    0.018587
    6.2    0.018587
    6.5    0.018587
    4.1    0.018587
    3.6    0.014870
    6.0    0.014870
    6.3    0.014870
    4.0    0.011152
    3.5    0.011152
    3.3    0.011152
    4      0.011152
    5      0.007435
    3.1    0.007435
    2.6    0.007435
    2.1    0.007435
    2.9    0.007435
    2.5    0.007435
    3.0    0.007435
    2.7    0.007435
    2.8    0.007435
    2.3    0.003717
    2.4    0.003717
    3      0.003717
    8.0    0.003717
    Name: proportion, dtype: float64
    
    Processing column: htn
    Possible categories and their frequencies:
    htn
    no     0.630653
    yes    0.369347
    Name: proportion, dtype: float64
    
    Processing column: dm
    Possible categories and their frequencies:
    dm
    no     0.655779
    yes    0.344221
    Name: proportion, dtype: float64
    
    Processing column: cad
    Possible categories and their frequencies:
    cad
    no     0.914573
    yes    0.085427
    Name: proportion, dtype: float64
    
    Processing column: appet
    Possible categories and their frequencies:
    appet
    good    0.794486
    poor    0.205514
    Name: proportion, dtype: float64
    
    Processing column: pe
    Possible categories and their frequencies:
    pe
    no     0.809524
    yes    0.190476
    Name: proportion, dtype: float64
    
    Processing column: ane
    Possible categories and their frequencies:
    ane
    no     0.849624
    yes    0.150376
    Name: proportion, dtype: float64
    
    Processing column: classification
    Possible categories and their frequencies:
    classification
    ckd       0.625
    notckd    0.375
    Name: proportion, dtype: float64
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 400 entries, 0 to 399
    Data columns (total 26 columns):
     #   Column          Non-Null Count  Dtype
    ---  ------          --------------  -----
     0   id              400 non-null    int64
     1   age             391 non-null    float64
     2   bp              388 non-null    float64
     3   sg              353 non-null    float64
     4   al              354 non-null    float64
     5   su              351 non-null    float64
     6   rbc             400 non-null    object
     7   pc              400 non-null    object
     8   pcc             400 non-null    object
     9   ba              400 non-null    object
     10  bgr             356 non-null    float64
     11  bu              381 non-null    float64
     12  sc              383 non-null    float64
     13  sod             313 non-null    float64
     14  pot             312 non-null    float64
     15  hemo            348 non-null    float64
     16  pcv             400 non-null    object
     17  wc              400 non-null    object
     18  rc              400 non-null    object
     19  htn             400 non-null    object
     20  dm              400 non-null    object
     21  cad             400 non-null    object
     22  appet           400 non-null    object
     23  pe              400 non-null    object
     24  ane             400 non-null    object
     25  classification  400 non-null    object
    dtypes: float64(11), int64(1), object(14)
    memory usage: 81.4+ KB
    Number of rows : 400
    Number of rows with at least one NaN value: 172
    43% of our rows have at least one missing value
  
    
  
 %% Cell type:code id: tags:
  
 ``` python
 # Example usage
 scale_normalize(df_kidney,numerical_columns)
 ```
  
 %% Output
  
    #######BEFORE SCALING AND NORMALIZING########
                   id         age          bp          sg          al          su  \
    count  400.000000  391.000000  388.000000  353.000000  354.000000  351.000000
    mean   199.500000   51.483376   76.469072    1.017408    1.016949    0.450142
    std    115.614301   17.169714   13.683637    0.005717    1.352679    1.099191
    min      0.000000    2.000000   50.000000    1.005000    0.000000    0.000000
    25%     99.750000   42.000000   70.000000    1.010000    0.000000    0.000000
    50%    199.500000   55.000000   80.000000    1.020000    0.000000    0.000000
    75%    299.250000   64.500000   80.000000    1.020000    2.000000    0.000000
    max    399.000000   90.000000  180.000000    1.025000    5.000000    5.000000
    
                  bgr          bu          sc         sod         pot        hemo
    count  356.000000  381.000000  383.000000  313.000000  312.000000  348.000000
    mean   148.036517   57.425722    3.072454  137.528754    4.627244   12.526437
    std     79.281714   50.503006    5.741126   10.408752    3.193904    2.912587
    min     22.000000    1.500000    0.400000    4.500000    2.500000    3.100000
    25%     99.000000   27.000000    0.900000  135.000000    3.800000   10.300000
    50%    121.000000   42.000000    1.300000  138.000000    4.400000   12.650000
    75%    163.000000   66.000000    2.800000  142.000000    4.900000   15.000000
    max    490.000000  391.000000   76.000000  163.000000   47.000000   17.800000
    #######AFTER SCALING AND NORMALIZING########
                     id           age            bp            sg            al  \
    count  4.000000e+02  3.910000e+02  3.880000e+02  3.530000e+02  3.540000e+02
    mean  -1.421085e-16  1.272071e-16  2.197555e-16  3.220590e-16  8.028731e-17
    std    1.001252e+00  1.001281e+00  1.001291e+00  1.001419e+00  1.001415e+00
    min   -1.727726e+00 -2.885708e+00 -1.936857e+00 -2.173584e+00 -7.528679e-01
    25%   -8.638630e-01 -5.530393e-01 -4.733701e-01 -1.297699e+00 -7.528679e-01
    50%   -9.540979e-17  2.050779e-01  2.583733e-01  4.540705e-01 -7.528679e-01
    75%    8.638630e-01  7.590867e-01  2.583733e-01  4.540705e-01  7.277723e-01
    max    1.727726e+00  2.246163e+00  7.575807e+00  1.329955e+00  2.948733e+00
    
                     su           bgr            bu            sc           sod  \
    count  3.510000e+02  3.560000e+02  3.810000e+02  3.830000e+02  3.130000e+02
    mean   2.024338e-17  1.596725e-16  5.594825e-17  1.855203e-17 -1.021547e-15
    std    1.001428e+00  1.001407e+00  1.001315e+00  1.001308e+00  1.001601e+00
    min   -4.101061e-01 -1.591967e+00 -1.108830e+00 -4.661019e-01 -1.280094e+01
    25%   -4.101061e-01 -6.193803e-01 -6.032459e-01 -3.788971e-01 -2.433340e-01
    50%   -4.101061e-01 -3.414983e-01 -3.058433e-01 -3.091332e-01  4.534651e-02
    75%   -4.101061e-01  1.890038e-01  1.700008e-01 -4.751867e-02  4.302539e-01
    max    4.145186e+00  4.319341e+00  6.613723e+00  1.271927e+01  2.451017e+00
    
                    pot          hemo
    count  3.120000e+02  3.480000e+02
    mean  -4.554761e-17 -2.858505e-16
    std    1.001606e+00  1.001440e+00
    min   -6.671023e-01 -3.241109e+00
    25%   -2.594231e-01 -7.655198e-01
    50%   -7.126345e-02  4.248496e-02
    75%    8.553625e-02  8.504897e-01
    max    1.328807e+01  1.813219e+00
  
 %% Cell type:code id: tags:
  
 ``` python
 nominal_columns = get_categorical_columns(df_kidney)
 df_kidney = convert_categorical_feats(df_kidney, nominal_columns)
 ```
  
 %% Output
  
    /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/sklearn/preprocessing/_encoders.py:972: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
      warnings.warn(
    /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/sklearn/preprocessing/_encoders.py:972: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
      warnings.warn(
    /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/sklearn/preprocessing/_encoders.py:972: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
      warnings.warn(
  
 %% Cell type:code id: tags:
  
 ``` python
 fill_numerical_columns(df_kidney, skew_threshold=0.5)
 ```
  
 %% Output
  
    id done !
    age done !
    bp done !
    sg done !
    al done !
    su done !
    rbc done !
    pc done !
    pcc done !
    ba done !
    bgr done !
    bu done !
    sc done !
    sod done !
    pot done !
    hemo done !
    htn done !
    dm done !
    cad done !
    appet done !
    pe done !
    ane done !
    classification done !
    pcv_14 done !
    pcv_15 done !
    pcv_16 done !
    pcv_17 done !
    pcv_18 done !
    pcv_19 done !
    pcv_20 done !
    pcv_21 done !
    pcv_22 done !
    pcv_23 done !
    pcv_24 done !
    pcv_25 done !
    pcv_26 done !
    pcv_27 done !
    pcv_28 done !
    pcv_29 done !
    pcv_30 done !
    pcv_31 done !
    pcv_32 done !
    pcv_33 done !
    pcv_34 done !
    pcv_35 done !
    pcv_36 done !
    pcv_37 done !
    pcv_38 done !
    pcv_39 done !
    pcv_40 done !
    pcv_41 done !
    pcv_42 done !
    pcv_43 done !
    pcv_44 done !
    pcv_45 done !
    pcv_46 done !
    pcv_47 done !
    pcv_48 done !
    pcv_49 done !
    pcv_50 done !
    pcv_51 done !
    pcv_52 done !
    pcv_53 done !
    pcv_54 done !
    pcv_9 done !
    wc_10200 done !
    wc_10300 done !
    wc_10400 done !
    wc_10500 done !
    wc_10700 done !
    wc_10800 done !
    wc_10900 done !
    wc_11000 done !
    wc_11200 done !
    wc_11300 done !
    wc_11400 done !
    wc_11500 done !
    wc_11800 done !
    wc_11900 done !
    wc_12000 done !
    wc_12100 done !
    wc_12200 done !
    wc_12300 done !
    wc_12400 done !
    wc_12500 done !
    wc_12700 done !
    wc_12800 done !
    wc_13200 done !
    wc_13600 done !
    wc_14600 done !
    wc_14900 done !
    wc_15200 done !
    wc_15700 done !
    wc_16300 done !
    wc_16700 done !
    wc_18900 done !
    wc_19100 done !
    wc_21600 done !
    wc_2200 done !
    wc_2600 done !
    wc_26400 done !
    wc_3800 done !
    wc_4100 done !
    wc_4200 done !
    wc_4300 done !
    wc_4500 done !
    wc_4700 done !
    wc_4900 done !
    wc_5000 done !
    wc_5100 done !
    wc_5200 done !
    wc_5300 done !
    wc_5400 done !
    wc_5500 done !
    wc_5600 done !
    wc_5700 done !
    wc_5800 done !
    wc_5900 done !
    wc_6000 done !
    wc_6200 done !
    wc_6300 done !
    wc_6400 done !
    wc_6500 done !
    wc_6600 done !
    wc_6700 done !
    wc_6800 done !
    wc_6900 done !
    wc_7000 done !
    wc_7100 done !
    wc_7200 done !
    wc_7300 done !
    wc_7400 done !
    wc_7500 done !
    wc_7700 done !
    wc_7800 done !
    wc_7900 done !
    wc_8000 done !
    wc_8100 done !
    wc_8200 done !
    wc_8300 done !
    wc_8400 done !
    wc_8500 done !
    wc_8600 done !
    wc_8800 done !
    wc_9000 done !
    wc_9100 done !
    wc_9200 done !
    wc_9300 done !
    wc_9400 done !
    wc_9500 done !
    wc_9600 done !
    wc_9700 done !
    wc_9800 done !
    wc_9900 done !
    rc_2.1 done !
    rc_2.3 done !
    rc_2.4 done !
    rc_2.5 done !
    rc_2.6 done !
    rc_2.7 done !
    rc_2.8 done !
    rc_2.9 done !
    rc_3 done !
    rc_3.0 done !
    rc_3.1 done !
    rc_3.2 done !
    rc_3.3 done !
    rc_3.4 done !
    rc_3.5 done !
    rc_3.6 done !
    rc_3.7 done !
    rc_3.8 done !
    rc_3.9 done !
    rc_4 done !
    rc_4.0 done !
    rc_4.1 done !
    rc_4.2 done !
    rc_4.3 done !
    rc_4.4 done !
    rc_4.5 done !
    rc_4.6 done !
    rc_4.7 done !
    rc_4.8 done !
    rc_4.9 done !
    rc_5 done !
    rc_5.0 done !
    rc_5.1 done !
    rc_5.2 done !
    rc_5.3 done !
    rc_5.4 done !
    rc_5.5 done !
    rc_5.6 done !
    rc_5.7 done !
    rc_5.8 done !
    rc_5.9 done !
    rc_6.0 done !
    rc_6.1 done !
    rc_6.2 done !
    rc_6.3 done !
    rc_6.4 done !
    rc_6.5 done !
  
 %% Cell type:code id: tags:
  
 ``` python
 df_kidney.sample()
 ```
  
 %% Output
  
              id       age       bp        sg        al        su  rbc  pc  pcc  \
    84 -1.000262  0.438345 -0.47337 -1.297699  1.468092 -0.410106    1   0    0
    
        ba  ...  rc_5.7  rc_5.8  rc_5.9  rc_6.0  rc_6.1  rc_6.2  rc_6.3  rc_6.4  \
    84   0  ...       0       0       0       0       0       0       0       0
    
        rc_6.5  rc_8.0
    84       0       0
    
    [1 rows x 202 columns]
  
 %% Cell type:code id: tags:
  
 ``` python
 df_kidney.isna().sum()
 ```
  
 %% Output
  
    id        0
    age       0
    bp        0
    sg        0
    al        0
             ..
    rc_6.2    0
    rc_6.3    0
    rc_6.4    0
    rc_6.5    0
    rc_8.0    0
    Length: 202, dtype: int64
  
 %% Cell type:code id: tags:
  
 ``` python
 ##
 df_kidney.sample()
 ```
  
 %% Output
  
               id       age        bp        sg        al        su  rbc  pc  pcc  \
    354  1.338013 -1.136206 -1.205114  1.329955 -0.752868 -0.410106    1   1    0
    
         ba  ...  rc_5.7  rc_5.8  rc_5.9  rc_6.0  rc_6.1  rc_6.2  rc_6.3  rc_6.4  \
    354   0  ...       0       0       0       0       0       0       0       0
    
         rc_6.5  rc_8.0
    354       0       0
    
    [1 rows x 202 columns]
  
 %% Cell type:code id: tags:
  
 ``` python
 df_kidney['classification'].value_counts()
 ```
  
 %% Output
  
    classification
    0    250
    1    150
    Name: count, dtype: int64
  
 %% Cell type:code id: tags:
  
 ``` python
 result_df, explainable_ratios= feature_selection(df_kidney, 'classification', threshold_variance_ratio=0.90)
 ```
  
 %% Cell type:code id: tags:
  
 ``` python
 result_df['classification'].value_counts()
 ```
  
 %% Output
  
    classification
    0    250
    1    150
    Name: count, dtype: int64
  
 %% Cell type:code id: tags:
  
 ``` python
 X_train, X_test, y_train, y_test, cv = split(result_df, 'classification',alpha=0.2,n=5)
 ```
  
 %% Cell type:code id: tags:
  
 ``` python
 fig, axes = plt.subplots(1, 2, figsize=(10, 4))
 axes[0].scatter(result_df.loc[result_df['classification'] == 0, 'PCA1'], result_df.loc[result_df['classification'] == 0, 'PCA2'], color='red', label="CKD")
 axes[0].scatter(result_df.loc[result_df['classification'] == 1, 'PCA1'], result_df.loc[result_df['classification'] == 1, 'PCA2'], color='green', label="NOT CKD")
 axes[0].set_xlabel("PCA1: First component")
 axes[0].set_ylabel("PCA2: Second component")
 axes[0].legend()
  
 axes[1].plot(range(1,len(explainable_ratios)+1), explainable_ratios)
 axes[1].axhline(y=0.90, linestyle='--', color='red', label='Threshold Ratio')
 axes[1].set_xlabel('Number of components')
 axes[1].set_ylabel('Cumulative Explainable variance')
  
 plt.tight_layout()
 plt.subplots_adjust(wspace=0.4)
 plt.show()
  
 #The two classes are distinguishable if we project the feature space onto the first two PCA's eigenvectors
 ```
  
 %% Output
  

  
 %% Cell type:code id: tags:
  
 ``` python
 dict_models = {
    'RandomForestClassifier': {
        'model': RandomForestClassifier(),
        'param_grid': {
            'n_estimators': [50, 100, 150, 200],
            'criterion': ['gini', 'entropy'],
            'max_depth': [None, 10, 20],
            'bootstrap': [True, False]
        }
    },
    'Logistic Regression': {
        'model': LogisticRegression(),
        'param_grid': {
            'C': [0.001, 0.01, 0.1, 1, 10, 100],
            'fit_intercept': [True, False],
            'intercept_scaling': [1, 10, 100]
        }
    },
    'AdaBoostClassifier': {
        'model': AdaBoostClassifier(),
        'param_grid': {
            'n_estimators': [50, 100, 150, 200],
            'algorithm': ['SAMME', 'SAMME.R'],
            'learning_rate': [0.01, 0.1, 0.5, 1]
        }
    }
 }
 ```
  
 %% Cell type:code id: tags:
  
 ``` python
 display_results(dict_models, X_train, y_train, X_test, y_test, cv, 'f1 scoring on Kidney data(%)')
 ```
  
 %% Output
  
    Going through each model defined in the dictionnary...:   0%|          | 0/3 [00:00<?, ?it/s]
  
    Fitting 5 folds for each of 48 candidates, totalling 240 fits
  
    /Users/ilyaschahed/git/mini-projet-intro-ml/binary_classification_workflow.py:500: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
      df_results = pd.concat([df_results, pd.DataFrame([new_row])], ignore_index=True)
  

  
    Model: RandomForestClassifier
    Accuracy: 0.975
    Precision: 1.0
    Recall: 0.9333333333333333
    ROC-AUC: 1.0
    
    
  

  
    Going through each model defined in the dictionnary...:  33%|███▎      | 1/3 [00:46<01:32, 46.18s/it]
  
    Fitting 5 folds for each of 36 candidates, totalling 180 fits
  

  
    Model: Logistic Regression
    Accuracy: 1.0
    Precision: 1.0
    Recall: 1.0
    ROC-AUC: 1.0
    
    
  

  
    Going through each model defined in the dictionnary...:  67%|██████▋   | 2/3 [00:48<00:20, 20.11s/it]
  
    Fitting 5 folds for each of 32 candidates, totalling 160 fits
  

  
    Model: AdaBoostClassifier
    Accuracy: 1.0
    Precision: 1.0
    Recall: 1.0
    ROC-AUC: 1.0
    
    
  

  
    Going through each model defined in the dictionnary...: 100%|██████████| 3/3 [00:49<00:00, 16.64s/it]
  
    <pandas.io.formats.style.Styler at 0x12e47c450>
  
-%% Cell type:code id: tags:
+%% Cell type:markdown id: tags:
  
-``` python
-```
+# Code good practices
+
+%% Cell type:markdown id: tags:
+
+Programming is not just about writing code that works; it's also about writing code that is maintainable, readable, and efficient. Good programming practices contribute to the overall quality of code, making it easier to understand, modify, and collaborate on. Here are some essential good programming practices that we tried to follow in our work.
+
+%% Cell type:markdown id: tags:
+
+### Code redability
+
+%% Cell type:markdown id: tags:
+
+- Use meaningful variable and function names: Choose names that clearly convey the purpose of the variable or function.
+- Maintain consistent line length: try to avoid lines longer than 80-120 characters
+
+%% Cell type:markdown id: tags:
+
+### Modularity
+
+%% Cell type:markdown id: tags:
+
+- Break code into functions or classes: Divide your code into smaller, reusable modules. This promotes code reuse and makes it easier to understand.
+
+%% Cell type:markdown id: tags:
+
+### Comments and documentation
+
+%% Cell type:markdown id: tags:
+
+- Write clear comments: Use comments to explain complex logic, assumptions, or any non-obvious aspects of your code.
+- Provide documentation: Include docstrings to describe the purpose, parameters, and return values of functions or methods.
+
+%% Cell type:markdown id: tags:
+
+###  Other good practices
+
+%% Cell type:markdown id: tags:
+
+- Implement proper error handling: Anticipate and handle exceptions gracefully to prevent unexpected crashes.
+- Use version control systems (e.g., Git): Keep track of changes, collaborate with others, and easily revert to previous versions if needed.
+- Optimize when necessary: Identify bottlenecks and optimize critical sections of your code. However, prioritize readability over premature optimization.
+
+%% Cell type:markdown id: tags:
+
+### Conclusion
+
+%% Cell type:markdown id: tags:
+
+Adhering to good programming practices is crucial for writing code that is not only functional but also maintainable, scalable, and collaborative. By following these practices, you contribute to the creation of high-quality software that stands the test of time. Remember, writing code is not just about solving a problem; it's about solving it in the best possible way.
No results found