EYHORN Konstantin
ML_Operator_Decision

Repository



ML Project - Operator Decision
This repository is an implementation of a Multi Layer Perceptron for the classification of alarms raised by the statistical model implemented by PokaPok association for the monitoring of the state of the ocean.
To set the environment for training and running the model, install the requirements using environment.yaml
The repository is made of the following folders:


dataset_pandas: Resulting datasets from feature engineering of the profiler's data

logs: A series of csv files that logs the metrics after each run of the model (Accuracy, Recall, Precision, F1-score, F2-score)

results: A series of plots for different setup ups used to train the model, namely an evolution of loss, accuracy plots and the results confusion matrix, as well as the distribution of each cell in the confusion matrix.

checkpoints: Checkpoints for the trained MLPs

additional_ressources: Contains the papers about the base statistical model and the report about the initial MLP implemented and the feature engineering process

The helper classes are the following:


dataloader.py: This is reponsible for the preprocessing of the data to prepare for the training. This includes setting the test and train data splits and undersampling methods.

mlp.py: This is the MLP class where the hyperparameter can be set and where the training is defined. It also contains helper method to load models from a checkpoint and other for the evaluation.

plot_results.py: This is a helper class to plot different metrics after the model has been trained. This includes plots for the accuracy and loss and logger methods to add metrics to the csv files.

The standalone scripts are the following:


convert_dataset2pkl.py: This script converts the data in the raw format, namely the different measures for salinity and temperature into a dataset of features. The overview of the features can be found in the report under additional_ressources/Projet ODM - Rapport de Stage V2.pdf.

accuracy_histogram.py: This script reads the input of a csv file that logs the metrics of the model and plots the histogram of the accuracies for the different runs

To run the model, the standalone script main.py is available. In the main.py, use the following arguments:


--run single: to run a single training of the model. This is useful to investigate the accuracy and loss evolution plots, as well as the confusion matrix for different setups.

--run multiple: to run the model for a fix model seed and different data splits. This is used to investigate the robustness of the model.

Two notebooks are provided to help gain information about the datasets:


dataset_exploration.ipynb: This is used to visualize the initial raw data, if the raw dataset is available.


measurements_exploration.ipynb: This is used to gather statistics about the datasets, i.e. total number of instances, number of instances with temperature/salinity alerts... etc