Skip to content
Snippets Groups Projects
Commit 89cda39c authored by AOUAD Mohamed, Jad's avatar AOUAD Mohamed, Jad
Browse files

Corrected some comments in the .py and added a section about good practices...

Corrected some comments in the .py and added a section about good practices for coding in the .ipynb
parent 31b9e72b
No related branches found
No related tags found
No related merge requests found
Source diff could not be displayed: it is too large. Options to address this: view the blob.
......@@ -3,7 +3,7 @@
Spyder Editor
This file contains the preprocessing functions needed to clean
and prepare the data. We first consider the data related to kidney diseases.
and prepare the data.
"""
import seaborn as sns
......@@ -19,6 +19,7 @@ from sklearn.metrics import f1_score
"""
kideney data
data description : 25 features ( 11 numeric ,14 nominal)
Numerical Data (11):
1. age: Age in years
......@@ -395,7 +396,8 @@ def split(df, target,alpha=0.2,n=5):
def convert_categorical_feats(df, categorical_cols):
"""
Encode the categorical features of the dataset using OrdinalEncoder and OneHotEncoder.
Encode the categorical features of the dataset using OrdinalEncoder
and OneHotEncoder.
Parameters:
----------
......
%% Cell type:code id: tags:
``` python
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from binary_classification_workflow import *
```
%% Cell type:code id: tags:
``` python
df_kidney = pd.read_csv('./data/kidney_disease.csv')
df_kidney.info()
nan_count = df_kidney[df_kidney.isna().any(axis=1)].shape[0]
print(f"Number of rows : {len(df_kidney)}")
print(f"Number of rows with at least one NAN value: {nan_count}")
print(f"{round(nan_count/len(df_kidney) * 100)}% of our rows have at least one"
f" missing value")
```
%% Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 400 non-null int64
1 age 391 non-null float64
2 bp 388 non-null float64
3 sg 353 non-null float64
4 al 354 non-null float64
5 su 351 non-null float64
6 rbc 248 non-null object
7 pc 335 non-null object
8 pcc 396 non-null object
9 ba 396 non-null object
10 bgr 356 non-null float64
11 bu 381 non-null float64
12 sc 383 non-null float64
13 sod 313 non-null float64
14 pot 312 non-null float64
15 hemo 348 non-null float64
16 pcv 330 non-null object
17 wc 295 non-null object
18 rc 270 non-null object
19 htn 398 non-null object
20 dm 398 non-null object
21 cad 398 non-null object
22 appet 399 non-null object
23 pe 399 non-null object
24 ane 399 non-null object
25 classification 400 non-null object
dtypes: float64(11), int64(1), object(14)
memory usage: 81.4+ KB
Number of rows : 400
Number of rows with at least one NAN value: 242
60% of our rows have at least one missing value
%% Cell type:code id: tags:
``` python
numerical_columns = get_numerical_columns(df_kidney)
nominal_columns = get_categorical_columns(df_kidney)
```
%% Cell type:code id: tags:
``` python
# visualise_numerical_data(df_kidney)
visualise_numerical_data(df_kidney,columns=numerical_columns)
```
%% Output
%% Cell type:code id: tags:
``` python
fill_categorical_kidney(df_kidney,nominal_columns)
df_kidney.info()
nan_count = df_kidney[df_kidney.isna().any(axis=1)].shape[0]
print(f"Number of rows : {len(df_kidney)}")
print(f"Number of rows with at least one NaN value: {nan_count}")
print(f"{round(nan_count/len(df_kidney) * 100)}% of our rows have at least one"
f" missing value")
```
%% Output
Going through each categorical feature...: 100%|██████████| 14/14 [00:00<00:00, 693.31it/s]
Processing column: rbc
Possible categories and their frequencies:
rbc
normal 0.810484
abnormal 0.189516
Name: proportion, dtype: float64
Processing column: pc
Possible categories and their frequencies:
pc
normal 0.773134
abnormal 0.226866
Name: proportion, dtype: float64
Processing column: pcc
Possible categories and their frequencies:
pcc
notpresent 0.893939
present 0.106061
Name: proportion, dtype: float64
Processing column: ba
Possible categories and their frequencies:
ba
notpresent 0.944444
present 0.055556
Name: proportion, dtype: float64
Processing column: pcv
Possible categories and their frequencies:
pcv
41 0.063830
52 0.063830
44 0.057751
48 0.057751
40 0.048632
43 0.045593
42 0.039514
45 0.039514
32 0.036474
50 0.036474
36 0.036474
33 0.036474
28 0.036474
34 0.033435
37 0.033435
30 0.027356
29 0.027356
35 0.027356
46 0.027356
31 0.024316
24 0.021277
39 0.021277
26 0.018237
38 0.015198
53 0.012158
51 0.012158
49 0.012158
47 0.012158
54 0.012158
25 0.009119
27 0.009119
22 0.009119
19 0.006079
23 0.006079
15 0.003040
21 0.003040
20 0.003040
17 0.003040
9 0.003040
18 0.003040
14 0.003040
16 0.003040
Name: proportion, dtype: float64
Processing column: wc
Possible categories and their frequencies:
wc
9800 0.037415
6700 0.034014
9600 0.030612
7200 0.030612
9200 0.030612
...
19100 0.003401
12300 0.003401
16700 0.003401
14900 0.003401
2600 0.003401
Name: proportion, Length: 89, dtype: float64
Processing column: rc
Possible categories and their frequencies:
rc
5.2 0.066914
4.5 0.059480
4.9 0.052045
4.7 0.040892
4.8 0.037175
3.9 0.037175
4.6 0.033457
3.4 0.033457
5.9 0.029740
5.5 0.029740
6.1 0.029740
5.0 0.029740
3.7 0.029740
5.3 0.026022
5.8 0.026022
5.4 0.026022
3.8 0.026022
5.6 0.022305
4.3 0.022305
4.2 0.022305
3.2 0.018587
4.4 0.018587
5.7 0.018587
6.4 0.018587
5.1 0.018587
6.2 0.018587
6.5 0.018587
4.1 0.018587
3.6 0.014870
6.0 0.014870
6.3 0.014870
4.0 0.011152
3.5 0.011152
3.3 0.011152
4 0.011152
5 0.007435
3.1 0.007435
2.6 0.007435
2.1 0.007435
2.9 0.007435
2.5 0.007435
3.0 0.007435
2.7 0.007435
2.8 0.007435
2.3 0.003717
2.4 0.003717
3 0.003717
8.0 0.003717
Name: proportion, dtype: float64
Processing column: htn
Possible categories and their frequencies:
htn
no 0.630653
yes 0.369347
Name: proportion, dtype: float64
Processing column: dm
Possible categories and their frequencies:
dm
no 0.655779
yes 0.344221
Name: proportion, dtype: float64
Processing column: cad
Possible categories and their frequencies:
cad
no 0.914573
yes 0.085427
Name: proportion, dtype: float64
Processing column: appet
Possible categories and their frequencies:
appet
good 0.794486
poor 0.205514
Name: proportion, dtype: float64
Processing column: pe
Possible categories and their frequencies:
pe
no 0.809524
yes 0.190476
Name: proportion, dtype: float64
Processing column: ane
Possible categories and their frequencies:
ane
no 0.849624
yes 0.150376
Name: proportion, dtype: float64
Processing column: classification
Possible categories and their frequencies:
classification
ckd 0.625
notckd 0.375
Name: proportion, dtype: float64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 400 non-null int64
1 age 391 non-null float64
2 bp 388 non-null float64
3 sg 353 non-null float64
4 al 354 non-null float64
5 su 351 non-null float64
6 rbc 400 non-null object
7 pc 400 non-null object
8 pcc 400 non-null object
9 ba 400 non-null object
10 bgr 356 non-null float64
11 bu 381 non-null float64
12 sc 383 non-null float64
13 sod 313 non-null float64
14 pot 312 non-null float64
15 hemo 348 non-null float64
16 pcv 400 non-null object
17 wc 400 non-null object
18 rc 400 non-null object
19 htn 400 non-null object
20 dm 400 non-null object
21 cad 400 non-null object
22 appet 400 non-null object
23 pe 400 non-null object
24 ane 400 non-null object
25 classification 400 non-null object
dtypes: float64(11), int64(1), object(14)
memory usage: 81.4+ KB
Number of rows : 400
Number of rows with at least one NaN value: 172
43% of our rows have at least one missing value
%% Cell type:code id: tags:
``` python
# Example usage
scale_normalize(df_kidney,numerical_columns)
```
%% Output
#######BEFORE SCALING AND NORMALIZING########
id age bp sg al su \
count 400.000000 391.000000 388.000000 353.000000 354.000000 351.000000
mean 199.500000 51.483376 76.469072 1.017408 1.016949 0.450142
std 115.614301 17.169714 13.683637 0.005717 1.352679 1.099191
min 0.000000 2.000000 50.000000 1.005000 0.000000 0.000000
25% 99.750000 42.000000 70.000000 1.010000 0.000000 0.000000
50% 199.500000 55.000000 80.000000 1.020000 0.000000 0.000000
75% 299.250000 64.500000 80.000000 1.020000 2.000000 0.000000
max 399.000000 90.000000 180.000000 1.025000 5.000000 5.000000
bgr bu sc sod pot hemo
count 356.000000 381.000000 383.000000 313.000000 312.000000 348.000000
mean 148.036517 57.425722 3.072454 137.528754 4.627244 12.526437
std 79.281714 50.503006 5.741126 10.408752 3.193904 2.912587
min 22.000000 1.500000 0.400000 4.500000 2.500000 3.100000
25% 99.000000 27.000000 0.900000 135.000000 3.800000 10.300000
50% 121.000000 42.000000 1.300000 138.000000 4.400000 12.650000
75% 163.000000 66.000000 2.800000 142.000000 4.900000 15.000000
max 490.000000 391.000000 76.000000 163.000000 47.000000 17.800000
#######AFTER SCALING AND NORMALIZING########
id age bp sg al \
count 4.000000e+02 3.910000e+02 3.880000e+02 3.530000e+02 3.540000e+02
mean -1.421085e-16 1.272071e-16 2.197555e-16 3.220590e-16 8.028731e-17
std 1.001252e+00 1.001281e+00 1.001291e+00 1.001419e+00 1.001415e+00
min -1.727726e+00 -2.885708e+00 -1.936857e+00 -2.173584e+00 -7.528679e-01
25% -8.638630e-01 -5.530393e-01 -4.733701e-01 -1.297699e+00 -7.528679e-01
50% -9.540979e-17 2.050779e-01 2.583733e-01 4.540705e-01 -7.528679e-01
75% 8.638630e-01 7.590867e-01 2.583733e-01 4.540705e-01 7.277723e-01
max 1.727726e+00 2.246163e+00 7.575807e+00 1.329955e+00 2.948733e+00
su bgr bu sc sod \
count 3.510000e+02 3.560000e+02 3.810000e+02 3.830000e+02 3.130000e+02
mean 2.024338e-17 1.596725e-16 5.594825e-17 1.855203e-17 -1.021547e-15
std 1.001428e+00 1.001407e+00 1.001315e+00 1.001308e+00 1.001601e+00
min -4.101061e-01 -1.591967e+00 -1.108830e+00 -4.661019e-01 -1.280094e+01
25% -4.101061e-01 -6.193803e-01 -6.032459e-01 -3.788971e-01 -2.433340e-01
50% -4.101061e-01 -3.414983e-01 -3.058433e-01 -3.091332e-01 4.534651e-02
75% -4.101061e-01 1.890038e-01 1.700008e-01 -4.751867e-02 4.302539e-01
max 4.145186e+00 4.319341e+00 6.613723e+00 1.271927e+01 2.451017e+00
pot hemo
count 3.120000e+02 3.480000e+02
mean -4.554761e-17 -2.858505e-16
std 1.001606e+00 1.001440e+00
min -6.671023e-01 -3.241109e+00
25% -2.594231e-01 -7.655198e-01
50% -7.126345e-02 4.248496e-02
75% 8.553625e-02 8.504897e-01
max 1.328807e+01 1.813219e+00
%% Cell type:code id: tags:
``` python
nominal_columns = get_categorical_columns(df_kidney)
df_kidney = convert_categorical_feats(df_kidney, nominal_columns)
```
%% Cell type:code id: tags:
``` python
fill_numerical_columns(df_kidney, skew_threshold=0.5)
```
%% Output
id done !
age done !
bp done !
sg done !
al done !
su done !
rbc done !
pc done !
pcc done !
ba done !
bgr done !
bu done !
sc done !
sod done !
pot done !
hemo done !
htn done !
dm done !
cad done !
appet done !
pe done !
ane done !
classification done !
pcv_14 done !
pcv_15 done !
pcv_16 done !
pcv_17 done !
pcv_18 done !
pcv_19 done !
pcv_20 done !
pcv_21 done !
pcv_22 done !
pcv_23 done !
pcv_24 done !
pcv_25 done !
pcv_26 done !
pcv_27 done !
pcv_28 done !
pcv_29 done !
pcv_30 done !
pcv_31 done !
pcv_32 done !
pcv_33 done !
pcv_34 done !
pcv_35 done !
pcv_36 done !
pcv_37 done !
pcv_38 done !
pcv_39 done !
pcv_40 done !
pcv_41 done !
pcv_42 done !
pcv_43 done !
pcv_44 done !
pcv_45 done !
pcv_46 done !
pcv_47 done !
pcv_48 done !
pcv_49 done !
pcv_50 done !
pcv_51 done !
pcv_52 done !
pcv_53 done !
pcv_54 done !
pcv_9 done !
wc_10200 done !
wc_10300 done !
wc_10400 done !
wc_10500 done !
wc_10700 done !
wc_10800 done !
wc_10900 done !
wc_11000 done !
wc_11200 done !
wc_11300 done !
wc_11400 done !
wc_11500 done !
wc_11800 done !
wc_11900 done !
wc_12000 done !
wc_12100 done !
wc_12200 done !
wc_12300 done !
wc_12400 done !
wc_12500 done !
wc_12700 done !
wc_12800 done !
wc_13200 done !
wc_13600 done !
wc_14600 done !
wc_14900 done !
wc_15200 done !
wc_15700 done !
wc_16300 done !
wc_16700 done !
wc_18900 done !
wc_19100 done !
wc_21600 done !
wc_2200 done !
wc_2600 done !
wc_26400 done !
wc_3800 done !
wc_4100 done !
wc_4200 done !
wc_4300 done !
wc_4500 done !
wc_4700 done !
wc_4900 done !
wc_5000 done !
wc_5100 done !
wc_5200 done !
wc_5300 done !
wc_5400 done !
wc_5500 done !
wc_5600 done !
wc_5700 done !
wc_5800 done !
wc_5900 done !
wc_6000 done !
wc_6200 done !
wc_6300 done !
wc_6400 done !
wc_6500 done !
wc_6600 done !
wc_6700 done !
wc_6800 done !
wc_6900 done !
wc_7000 done !
wc_7100 done !
wc_7200 done !
wc_7300 done !
wc_7400 done !
wc_7500 done !
wc_7700 done !
wc_7800 done !
wc_7900 done !
wc_8000 done !
wc_8100 done !
wc_8200 done !
wc_8300 done !
wc_8400 done !
wc_8500 done !
wc_8600 done !
wc_8800 done !
wc_9000 done !
wc_9100 done !
wc_9200 done !
wc_9300 done !
wc_9400 done !
wc_9500 done !
wc_9600 done !
wc_9700 done !
wc_9800 done !
wc_9900 done !
rc_2.1 done !
rc_2.3 done !
rc_2.4 done !
rc_2.5 done !
rc_2.6 done !
rc_2.7 done !
rc_2.8 done !
rc_2.9 done !
rc_3 done !
rc_3.0 done !
rc_3.1 done !
rc_3.2 done !
rc_3.3 done !
rc_3.4 done !
rc_3.5 done !
rc_3.6 done !
rc_3.7 done !
rc_3.8 done !
rc_3.9 done !
rc_4 done !
rc_4.0 done !
rc_4.1 done !
rc_4.2 done !
rc_4.3 done !
rc_4.4 done !
rc_4.5 done !
rc_4.6 done !
rc_4.7 done !
rc_4.8 done !
rc_4.9 done !
rc_5 done !
rc_5.0 done !
rc_5.1 done !
rc_5.2 done !
rc_5.3 done !
rc_5.4 done !
rc_5.5 done !
rc_5.6 done !
rc_5.7 done !
rc_5.8 done !
rc_5.9 done !
rc_6.0 done !
rc_6.1 done !
rc_6.2 done !
rc_6.3 done !
rc_6.4 done !
rc_6.5 done !
%% Cell type:code id: tags:
``` python
df_kidney.isna().sum()
```
%% Output
id 0
age 0
bp 0
sg 0
al 0
..
rc_6.2 0
rc_6.3 0
rc_6.4 0
rc_6.5 0
rc_8.0 0
Length: 202, dtype: int64
%% Cell type:code id: tags:
``` python
df_kidney['classification'].value_counts()
```
%% Output
classification
0 250
1 150
Name: count, dtype: int64
%% Cell type:code id: tags:
``` python
result_df, explainable_ratios= feature_selection(df_kidney, 'classification', threshold_variance_ratio=0.90)
```
%% Cell type:code id: tags:
``` python
result_df['classification'].value_counts()
```
%% Output
classification
0 250
1 150
Name: count, dtype: int64
%% Cell type:code id: tags:
``` python
X_train, X_test, y_train, y_test, cv = split(result_df, 'classification',alpha=0.2,n=5)
```
%% Cell type:code id: tags:
``` python
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].scatter(result_df.loc[result_df['classification'] == 0, 'PCA1'], result_df.loc[result_df['classification'] == 0, 'PCA2'], color='red', label="CKD")
axes[0].scatter(result_df.loc[result_df['classification'] == 1, 'PCA1'], result_df.loc[result_df['classification'] == 1, 'PCA2'], color='green', label="NOT CKD")
axes[0].set_xlabel("PCA1: First component")
axes[0].set_ylabel("PCA2: Second component")
axes[0].legend()
axes[1].plot(range(1,len(explainable_ratios)+1), explainable_ratios)
axes[1].axhline(y=0.90, linestyle='--', color='red', label='Threshold Ratio')
axes[1].set_xlabel('Number of components')
axes[1].set_ylabel('Cumulative Explainable variance')
plt.tight_layout()
plt.subplots_adjust(wspace=0.4)
plt.show()
#The two classes are distinguishable if we project the feature space onto the first two PCA's eigenvectors
```
%% Output
%% Cell type:code id: tags:
``` python
dict_models = {
'RandomForestClassifier': {
'model': RandomForestClassifier(),
'param_grid': {
'n_estimators': [50, 100, 150, 200],
'criterion': ['gini', 'entropy'],
'max_depth': [None, 10, 20],
'bootstrap': [True, False]
}
},
'Logistic Regression': {
'model': LogisticRegression(),
'param_grid': {
'C': [0.001, 0.01, 0.1, 1, 10, 100],
'fit_intercept': [True, False],
'intercept_scaling': [1, 10, 100]
}
},
'AdaBoostClassifier': {
'model': AdaBoostClassifier(),
'param_grid': {
'n_estimators': [50, 100, 150, 200],
'algorithm': ['SAMME', 'SAMME.R'],
'learning_rate': [0.01, 0.1, 0.5, 1]
}
}
}
```
%% Cell type:code id: tags:
``` python
display_results(dict_models, X_train, y_train, X_test, y_test, cv, 'f1 scoring on Kidney data(%)')
```
%% Output
Going through each model defined in the dictionnary...: 0%| | 0/3 [00:00<?, ?it/s]
Fitting 5 folds for each of 48 candidates, totalling 240 fits
Going through each model defined in the dictionnary...: 33%|███▎ | 1/3 [00:07<00:15, 7.74s/it]
Fitting 5 folds for each of 36 candidates, totalling 180 fits
Going through each model defined in the dictionnary...: 67%|██████▋ | 2/3 [00:09<00:04, 4.20s/it]
Fitting 5 folds for each of 32 candidates, totalling 160 fits
Going through each model defined in the dictionnary...: 100%|██████████| 3/3 [00:13<00:00, 4.45s/it]
<pandas.io.formats.style.Styler at 0x7feba1c65430>
%% Cell type:markdown id: tags:
# Code good practices
%% Cell type:markdown id: tags:
Programming is not just about writing code that works; it's also about writing code that is maintainable, readable, and efficient. Good programming practices contribute to the overall quality of code, making it easier to understand, modify, and collaborate on. Here are some essential good programming practices that we tried to follow in our work.
%% Cell type:markdown id: tags:
### Code redability
%% Cell type:markdown id: tags:
- Use meaningful variable and function names: Choose names that clearly convey the purpose of the variable or function.
- Maintain consistent line length: try to avoid lines longer than 80-120 characters
%% Cell type:markdown id: tags:
### Modularity
%% Cell type:markdown id: tags:
- Break code into functions or classes: Divide your code into smaller, reusable modules. This promotes code reuse and makes it easier to understand.
%% Cell type:markdown id: tags:
### Comments and documentation
%% Cell type:markdown id: tags:
- Write clear comments: Use comments to explain complex logic, assumptions, or any non-obvious aspects of your code.
- Provide documentation: Include docstrings to describe the purpose, parameters, and return values of functions or methods.
%% Cell type:markdown id: tags:
### Other good practices
%% Cell type:markdown id: tags:
- Implement proper error handling: Anticipate and handle exceptions gracefully to prevent unexpected crashes.
- Use version control systems (e.g., Git): Keep track of changes, collaborate with others, and easily revert to previous versions if needed.
- Optimize when necessary: Identify bottlenecks and optimize critical sections of your code. However, prioritize readability over premature optimization.
%% Cell type:markdown id: tags:
### Conclusion
%% Cell type:markdown id: tags:
Adhering to good programming practices is crucial for writing code that is not only functional but also maintainable, scalable, and collaborative. By following these practices, you contribute to the creation of high-quality software that stands the test of time. Remember, writing code is not just about solving a problem; it's about solving it in the best possible way.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment