Skip to content
Snippets Groups Projects

Update README: Add code examples and improve project documentation

Merged DOLATI Mohammad requested to merge m24dolat-main-patch-98093 into main
1 file
+ 101
7
Compare changes
  • Side-by-side
  • Inline
+ 101
7
@@ -4,19 +4,113 @@
## Introduction
The objective of the project is to apply a Machine Learning model for Binary classification onto two different datasets:
Banknote Authentication Dataset: https://archive.ics.uci.edu/ml/datasets/banknote+authentication
Chronic Kidney Disease:
https://www.kaggle.com/mansoordaku/ckdisease
This project applies a binary classification workflow to two datasets, showcasing a comprehensive machine learning pipeline:
1. **Banknote Authentication Dataset**: Utilized to distinguish between authentic and forged banknotes, demonstrating robust feature processing and classification techniques.
https://archive.ics.uci.edu/ml/datasets/banknote+authentication
2. **Chronic Kidney Disease Dataset**: Focused on predicting the presence or absence of chronic kidney disease, emphasizing data cleaning, feature engineering, and reliable model evaluation.
https://www.kaggle.com/mansoordaku/ckdisease
## Description
This project focuses on developing collaborative Machine Learning workflows. The objective is to implement binary classification models on datasets such as the Banknote Authentication Dataset and Chronic Kidney Disease Dataset. We will apply the ML workflow—data import, cleaning, splitting, training, validation—and collaborate using GIT for version control.
This project implements a structured machine learning workflow focusing on quality and reproducibility. It includes:
- Data Preprocessing: Cleaning data, encoding categorical variables, scaling features, and removing irrelevant or imbalanced classes.
- Model Training & Tuning: Splitting data, training models (e.g., Random Forest, Neural Networks), and optimizing hyperparameters with GridSearchCV.
- Result Visualization: Plotting performance metrics and analyzing feature distributions.
- Best Practices: Modular code, unit testing, and Git-based collaboration.
This pipeline ensures reliability and adheres to industry best practices.
## Installation
To run this project, you need Python installed on your machine along with the following essential libraries that are widely used in machine learning and data analysis workflows:
- **Pandas**: Used for data manipulation and analysis.
- **NumPy**: For numerical operations and working with arrays.
- **Matplotlib**: Used for plotting and visualizing data.
- **Seaborn**: A library built on top of Matplotlib, used for advanced statistical data visualization.
- **Scikit-learn**: A machine learning library providing tools for preprocessing, model training, and evaluation.
- **TQDM**: For creating progress bars when iterating through data.
- **SciPy**: Provides statistical functions like Z-score calculation.
- **Unittest**:
Used for writing and executing unit tests to validate individual components of the project.
You can install these packages using the following command:
```python
pip install pandas numpy matplotlib seaborn scikit-learn tqdm scipy
```
`
## File Description
`code.py`:
- Contains all functions for data preprocessing, model training, validation, and result visualization.
`notebook.ipynb`:
- Demonstrates the application of the functions from `code.py` to the datasets.
`test_preprocessing.py`, `test_model_training.py`, `test_visualization.py`:
- Unit tests for individual components of the project.
`kidney_disease.csv`, `data_banknote_authentication.csv`:
- Datasets.
## Usage
1. **Clone the Repository**:
```python
git clone https://gitlab.imt-atlantique.fr/a23elhas/projet-ue-a-intro-ml.git
```
2. **Download the Datasets**:
- Place the datasets in the project directory.
3. **Run the Jupyter Notebook**:
- Open `notebook.ipynb`to explore the workflow and results.
```python
jupyter notebook notebook.ipynb
```
4. **Execute Unit Tests**:
- Run the unit tests to verify the correctness of the implemented functions:
```python
python -m unittest discover -s . -p "test_*.py"
```
## Functions Overview
1. **Data Preprocessing**
Functions for:
- Handling missing values
- Scaling and normalizing numerical features
- Encoding categorical features
- Removing irrelevant or skewed data
2. **Model Training and Validation**
Functions for:
- Splitting the dataset into training and testing sets
- Training classification models
- Hyperparameter tuning using GridSearchCV
- Comparing model performance
3. **Result Visualization**
Functions for:
- Plotting data distributions
- Comparing model performance metrics
- Displaying results for analysis
4. **Unit Testing**
- Ensures the reliability of preprocessing, model training, and visualization functions
## Functions Overview
This project is part of the course "Intro to ML" at IMT Atlantique.
Loading