DOLATI Mohammad · 206519fd
--- a/README.md

+ 101

− 7

View file @ 206519fd

Open in Web IDE
+++ b/README.md

+ 101

− 7

View file @ 206519fd

Open in Web IDE
 @@ -4,19 +4,113 @@

 ## Introduction

- The objective of the project is to apply a Machine Learning model for Binary classification onto two different datasets:
-
- Banknote Authentication Dataset: https://archive.ics.uci.edu/ml/datasets/banknote+authentication 
- 
- Chronic Kidney Disease:
- https://www.kaggle.com/mansoordaku/ckdisease
+ This project applies a binary classification workflow to two datasets, showcasing a comprehensive machine learning pipeline:

+ 1. **Banknote Authentication Dataset**: Utilized to distinguish between authentic and forged banknotes, demonstrating robust feature processing and classification techniques.
+ https://archive.ics.uci.edu/ml/datasets/banknote+authentication

+2. **Chronic Kidney Disease Dataset**: Focused on predicting the presence or absence of chronic kidney disease, emphasizing data cleaning, feature engineering, and reliable model evaluation.
+https://www.kaggle.com/mansoordaku/ckdisease

 ## Description
-This project focuses on developing collaborative Machine Learning workflows. The objective is to implement binary classification models on datasets such as the Banknote Authentication Dataset and Chronic Kidney Disease Dataset. We will apply the ML workflow—data import, cleaning, splitting, training, validation—and collaborate using GIT for version control.
+This project implements a structured machine learning workflow focusing on quality and reproducibility. It includes:
+
+
+- Data Preprocessing: Cleaning data, encoding categorical variables, scaling features, and removing irrelevant or imbalanced classes.
+
+- Model Training & Tuning: Splitting data, training models (e.g., Random Forest, Neural Networks), and optimizing hyperparameters with GridSearchCV.
+
+- Result Visualization: Plotting performance metrics and analyzing feature distributions.
+
+- Best Practices: Modular code, unit testing, and Git-based collaboration.
+This pipeline ensures reliability and adheres to industry best practices.
+
+## Installation
+
+To run this project, you need Python installed on your machine along with the following essential libraries that are widely used in machine learning and data analysis workflows:
+
+
+- **Pandas**: Used for data manipulation and analysis.
+- **NumPy**: For numerical operations and working with arrays.
+- **Matplotlib**: Used for plotting and visualizing data.
+- **Seaborn**: A library built on top of Matplotlib, used for advanced statistical data visualization.
+- **Scikit-learn**: A machine learning library providing tools for preprocessing, model training, and evaluation.
+- **TQDM**: For creating progress bars when iterating through data.
+- **SciPy**: Provides statistical functions like Z-score calculation.
+- **Unittest**: 
+Used for writing and executing unit tests to validate individual components of the project.
+
+You can install these packages using the following command:
+```python
+
+pip install pandas numpy matplotlib seaborn scikit-learn tqdm scipy
+```
+`
+## File Description
+`code.py`:
+
+- Contains all functions for data preprocessing, model training, validation, and result visualization.
+
+`notebook.ipynb`:
+- Demonstrates the application of the functions from `code.py` to the datasets.
+
+`test_preprocessing.py`, `test_model_training.py`, `test_visualization.py`:
+- Unit tests for individual components of the project.
+
+`kidney_disease.csv`, `data_banknote_authentication.csv`:
+- Datasets.
+## Usage
+1. **Clone the Repository**:
+```python
+git clone https://gitlab.imt-atlantique.fr/a23elhas/projet-ue-a-intro-ml.git
+```
+
+2. **Download the Datasets**:
+- Place the datasets in the project directory.
+
+3. **Run the Jupyter Notebook**:
+- Open `notebook.ipynb`to explore the workflow and results.
+```python
+jupyter notebook notebook.ipynb 
+```
+
+4. **Execute Unit Tests**:
+
+- Run the unit tests to verify the correctness of the implemented functions:
+```python
+python -m unittest discover -s . -p "test_*.py"
+```
+## Functions Overview
+1. **Data Preprocessing**
+
+Functions for:
+
+- Handling missing values
+- Scaling and normalizing numerical features
+- Encoding categorical features
+- Removing irrelevant or skewed data
+
+
+2. **Model Training and Validation**
+
+Functions for:
+- Splitting the dataset into training and testing sets
+- Training classification models
+- Hyperparameter tuning using GridSearchCV
+- Comparing model performance
+
+3. **Result Visualization**
+
+Functions for:
+- Plotting data distributions
+- Comparing model performance metrics
+- Displaying results for analysis

+4. **Unit Testing**
+- Ensures the reliability of preprocessing, model training, and visualization functions

+## Functions Overview
+This project is part of the course "Intro to ML" at IMT Atlantique.