Skip to content
Snippets Groups Projects
user avatar
Konstantin Gerd Eyhorn authored
Merge branch 'visualize_errors' of gitlab.imt-atlantique.fr:k24eyhor/ml_operator_decision into visualize_errors
18e71c91
History

ML Project - Operator Decision

This repository is an implementation of a Multi Layer Perceptron for the classification of alarms raised by the statistical model implemented by PokaPok association for the monitoring of the state of the ocean.

To set the environment for training and running the model, install the requirements using environment.yaml

The repository is made of the following folders:

  • dataset_pandas: Resulting datasets from feature engineering of the profiler's data
  • logs: A series of csv files that logs the metrics after each run of the model (Accuracy, Recall, Precision, F1-score, F2-score)
  • results: A series of plots for different setup ups used to train the model, namely an evolution of loss, accuracy plots and the results confusion matrix, as well as the distribution of each cell in the confusion matrix.
  • checkpoints: Checkpoints for the trained MLPs
  • additional_ressources: Contains the papers about the base statistical model and the report about the initial MLP implemented and the feature engineering process

The helper classes are the following:

  • dataloader.py: This is reponsible for the preprocessing of the data to prepare for the training. This includes setting the test and train data splits and undersampling methods.
  • mlp.py: This is the MLP class where the hyperparameter can be set and where the training is defined. It also contains helper method to load models from a checkpoint and other for the evaluation.
  • plot_results.py: This is a helper class to plot different metrics after the model has been trained. This includes plots for the accuracy and loss and logger methods to add metrics to the csv files.

The standalone scripts are the following:

To run the model, the standalone script main.py is available. In the main.py, use the following arguments:

  1. --run single: to run a single training of the model. This is useful to investigate the accuracy and loss evolution plots, as well as the confusion matrix for different setups.
  2. --run multiple: to run the model for a fix model seed and different data splits. This is used to investigate the robustness of the model.

Two notebooks are provided to help gain information about the datasets:

  1. dataset_exploration.ipynb: This is used to visualize the initial raw data, if the raw dataset is available.

  2. measurements_exploration.ipynb: This is used to gather statistics about the datasets, i.e. total number of instances, number of instances with temperature/salinity alerts... etc