finished report

185faf89 · Boshra Ariguib · 770eff9e · 185faf89 · 185faf89 · 185faf89
Commit 185faf89 authored 1 year ago by Boshra Ariguib
--- a/Report_draft.md
+++ b/Report_draft.md
 <h1 style="text-align: center;">Machine Learning Operator Decision: Report</h1>
+*Authors:* Boshra Ariguib, Konstantin Eyhorn, Yanice Moreau
 # Background and Motivation
 Within the framework of the [Copernicus project](HTTPS://WWW.COPERNICUS.EU/FR/PROPOS-DE-COPERNICUS), a forecasting system has been setup up to monitor the state of the ocean at the European level. Since the beginning of this project, different means of collecting In Situ observations have been put into place and have successfully collected observations. One such mean are the profilers from the [ARGO network](https://argo.ucsd.edu) with relatively homogeneous spatical coverage, that are reponsible for collecting, among other things, measures for temperature and salinity levels of the ocean.
-Since 2014, this collected data is being processed by POKAPOK scientists to improve its quality control. For this purpose, statistical methods (ADD SOURCE) have been set up to detect extreme values, that can be interpreted as alarms and unsual values. A second step is done by a human operator where a label "good" or "bad" is added to the data where "bad" means the alarm was rightfully raised, and "good" means this is a case of a false alarm.
+Since 2014, this collected data is being processed by POKAPOK scientists to improve its quality control. For this purpose, statistical methods have been set up to detect extreme values, that can be interpreted as alarms and unsual values. A second step is done by a human operator where a label "good" or "bad" is added to the data where "bad" means the alarm was rightfully raised, and "good" means this is a case of a false alarm.
 Current statistical models used sucessfully detect most alarms, however the proportion of false alarm is still relatively high, which leads to a high amount of data, that has to be manually processed.
 # Introduction
 ## Project objective
-The purpose of this project is to reduce this amount of data, that has to be manually processed. For this, we aim to develop a model, that receives all the alarms raised from the statistical method as input and classifies these into true alerts and false alerts. Throughout this project, our primary objective will be to reduce the workload of the human operator. This means our focus will not be on optimizing the performance of the model over all the possible metrics but rather to have a consistent model, that can classify with high accuracy some subset of the data, while leaving only a few instances to be processed manually.
+The purpose of this project is to reduce this amount of data, that has to be manually processed. For this, we aim to develop a model, that receives all the alarms raised from the statistical method as input and classifies these into true alerts and false alerts. Throughout this project, our primary objective will be to reduce the workload of the human operator. 
 ## Current Work
-A model has been developed in the framework of an internship project by Romaric MOYEUVRE to solve the same problem ([Report here](./additional_ressources/Projet_ODM_Rapport_de_Stage_V2.pdf)). The scope of the project covered the investigaten of the dataset, the processing of the data and the implementation of four different machine learning models, namely a random forest model, a decision tree classifier, a Convolutional Neural Network and a Multi Layer Perceptron. While the first mentioned methods showed a high performance, an advantage of the Multi Layer Perceptron was that the classification was done using a binary cross-entropy loss, meaning that the output of the model is a value between 0 and 1, where 0 represents a false alert and 1 represents a good one. This allows for a measure of confidence of the model, that is useful for our objective.
+A model has been developed in the framework of an internship project by Romaric MOYEUVRE to solve the same problem ([See Project Report](./additional_ressources/Projet_ODM_Rapport_de_Stage_V2.pdf)). The scope of the project covered the processing of the data and the implementation of four different machine learning models, namely a random forest model, a decision tree classifier, a Convolutional Neural Network and a Multi Layer Perceptron. While the first mentioned methods showed a high performance, an advantage of the Multi Layer Perceptron was that the classification was done using a binary cross-entropy loss, meaning that the output of the model is a value between 0 and 1, where 0 represents a false alert and 1 represents a good one. This allows for a measure of confidence of the model, that is useful for our objective.
-For this reason, we will use this model as our starting point. The issues presented in the current implementation is that the model is highly variable depending on different setups. Moreover, the model uses too many parameters for the scale of our dataset. Thus, our project aims to reduce the complexity of the model, while making it more consistent among different subsets training data. We will not focus on achieving the best accuracy, but we will use it as a guidance to evaluate how good the model performs with different setups.
+For this reason, we will use this model as our starting point. The issues presented in the current implementation is that the model is highly unstable. Moreover, the model uses too many parameters for the scale of our dataset. Thus, our project aims to reduce the complexity of the model, while making it more consistent among different setups. We will not focus on achieving the best accuracy, but we will use it as a guidance to evaluate how good the model performs with different setups.
 # Data
@@ -45,15 +46,15 @@ For this reason, we will use this model as our starting point. The issues presen
 ### Data Format
-The Dataset consists of several profiles of temperature and salinity measurements. Each profile is labeled with a binary value, that indicates if the profile is a false alarm or not, as well as the type of alarm (temperature or salinity) (in the new datasets).
+The Dataset consists of several profiles of temperature and salinity measurements. We processed the data such that each profile is labeled with a binary value, that indicates if the profile is a true alarm (1) or not (0), as well as the type of alarm (temperature or salinity).
 ### Feature Engineering
-The same features as in the previous work were used. Details can be found in the report of the previous work.
+We apply feature engineering on the raw measurements and use these as the input to the model. The same features were used as in the previous work. The full overview of the features, as well as their mathematical definition, can be found in the report of the previous work.
 ### Train and Test for the MLP
-The data was split into a training and a test set. The training set was used to train the model, while the test set was used to evaluate the model. The split was done in a way, that the training set contained 80% of the data and the test set contained 20% of the data. Due to the fact that we have more True Alarms than False Alarms, we balanced the training set by randomly selecting the same number of False Alarms as True Alarms.
+The data was split into a train and a test set. The train set was used to train the model, while the test set was used to evaluate the model. The split was done in a way, that the train set contained 80% of the data and the test set contained 20% of the data. Due to the fact that we have more True Alarms than False Alarms, we balanced the training set by randomly sampling the same number of False Alarms as True Alarms, i.e. we discard some samples from the majority class, in order to achieve a 50%-50% balance of the dataset, so we ensure that there is minimal bias in the model.
 # MLP Performance Investigations
@@ -63,80 +64,110 @@ We built a Multi Layer Perceptron (MLP) model with the same architecture as in t
 We started with the same hyperparameters as in the previous work:
- Batch-Size = 32
+<div style="display: flex;">
- Learning Rate = 1e-4
+  <div style="flex: 1; padding: 5px;">
- Growth Rate = 32
+    <ul>
- Epochs = 200
+      <li>Batch-Size = 32</li>
- Optimizer = Adam
+      <li>Learning Rate = 5e-3</li>
- Scheduler = ReduceLROnPlateau
+      <li>Growth Rate = 32</li>
- Loss = Binary Cross-Entropy
+    </ul>
+  </div>
+  <div style="flex: 1.5; padding: 5px;">
+    <ul>
+      <li>Epochs = 350</li>
+      <li>Optimizer = SGD</li>
+      <li>Scheduler = ReduceLROnPlateau</li>
+    </ul>
+  </div>
+</div>
+During our experiments, we realized that the model was performing equally well with a lower growth rate of 16. Because we think that the model should be as simple as possible, we decided to keep the growth rate at 16. We also decreade the learning rate to 1e-4 to keep the exploration minimal. We also removed the dropout method and we reduced the number of epochs to 200, since we notice little change in the final epochs. This managed to further decrease the complexity of our model. 
+## Stabilizing the Model
+We observe that the model performance is highly variable depending on how the data is split between the train and test sets. Our next goal was therefore to investigate what causes this high variability and how to reduce it.
-During our experiments, we realized that the model was performing equally well with a lower growth rate of 16. Because we think that the model should be as simple as possible, we decided to keep the growth rate at 16.
+**Purely linear model:**  
+We tried to reduce the complexity of the model by using a purely linear model. This model consists of only one fully connected layer with a sigmoid activation function. The model was trained with the same hyperparameters as the MLP model. The model performed poorly, which was expected, because the data is not linearly separable. This shows that the model needs to have a certain complexity to be able to classify the data, but the complexity should not be too high.
-## Problem with high variability
+**Confidence Model:**  
+As the goal of this project is to reduce the amount of data that has to be manually processed, we wanted to have a measure of confidence of the model. Our model outputs a value between 0 and 1, where 0 represents a false alarm and 1 represents a good alarm. We wanted to check whether we can use this as a measure of confidence and using a two threshold system to classify the data into three classes: False Alarm, Good Alarm and Uncertain:
-We realized that the model performance is highly variable depending on the training and testing data. This is a problem because we want to have a model that is consistent among different setups. Our next goal was therefore to investigate what causes this high variability and how we can reduce it.
+<div style="text-align: center;">
+  <img src="./results/final_results/confidence_model_equations.png" alt="Equations" width="40%"/>
+</div>
-### Possible fixes tried
+<!--
+$$\begin{split} \text{False Alarm } \quad & \text{if} \quad \text{output} \leq 0.3 \\ \text{Good Alarm } \quad & \text{if} \quad \text{output} \geq 0.7 \\ \text{Uncertain } \quad & \text{if} \quad 0.3 < \text{output} < 0.7 \end{split}$$ -->
-**Purely linear model:**  
-We tried to reduce the complexity of the model by using a purely linear model. This model consists of only one fully connected layer with a sigmoid activation function. The model was trained with the same hyperparameters as the MLP model. The model performed poorly, which was expected, because the data is not linearly separable. This shows that the model needs to have a certain complexity to be able to classify the data, but the complexity should not be too high.
-**Different loss functions: (WeightedBCE, Focal Loss):**  
+We could observe that indeed the model overall seems to be less confident when it is wrong. However, the model is still highly variable depending on the training and testing data.
-To mitigate the problem of the inbalanced dataset, we wanted to try another solution than balancing the dataset, given the fact that we loose a lot of data in the process. Assuming that the lack of data is one reason for the high variability of the model, wanted to try other approaches. We tried to use the Weighted Binary Cross-Entropy Loss and the Focal Loss. The Weighted Binary Cross-Entropy Loss assigns different weights to the classes, so that the model is more penalized for misclassifying the minority class. The Focal Loss is a modification of the Cross-Entropy Loss, that puts more focus on hard to classify examples. Unfortunately, we weren't able to match the performance of the model with the previous approach. Also, with these loss functions, we are introducing new hyperparameters, that would be difficult to tune.
-**Shapely Values:**  
+To further understand this result, we decided to plot the distribution of the different elements from the Confusion Matrix. The aim why the model's confidence is low and indeed, as we can see from the plot below, this is explained by the fact that the even the True Positives and True Negatives predicted aren't close to the edges, unlike what we would except. Indeed, they have an almost equal distribution on the value starting from 0.5 to the respetive edge. This was confirmed over multiple runs.
-To gain more insights into which of the features the model is attending to, we used the Shapely Values. The Shapely Values are a method to explain the output of a model by attributing the output to the input features. What we found out is that the model is attending to different features depending on the training and testing data. Which further confirms our assumption that the model is highly variable depending on the training and testing data.
-**Confidence Model:**  
+**Histogram of the predicted values distribution by Confusion Matrix Element:**
-As the goal of this project is to reduce the amount of data that has to be manually processed, we wanted to have a measure of confidence of the model. Our model outputs a value between 0 and 1, where 0 represents a false alarm and 1 represents a good alarm. We wanted to check whether we can use this as a measure of confidence and using a two threshold system to classify the data into three classes: False Alarm, Good Alarm and Uncertain:
-$$ \text{False Alarm} \quad \text{if} \quad \text{output} < 0.3 $$
-$$ \text{Good Alarm} \quad \text{if} \quad \text{output} > 0.7 $$
-$$ \text{Uncertain} \quad \text{if} \quad 0.3 < \text{output} < 0.7 $$
-We could observe that indeed the model overall seems to be less confident when it is wrong. However, the model is still highly variable depending on the training and testing data.
-### BatchNorm Layer Removal
+<div style="text-align: center;">
+  <img src="./results/final_results/histogram_confusion_matrix.png" alt="Confusion Matrix Histogram" width="80%"/>
+</div>
+**Different loss functions (WeightedBCE, Focal Loss):**  
+To mitigate the problem of the unbalanced dataset, we wanted to avoid undersampling, as to not loose much data samples. We tried using the Weighted Binary Cross-Entropy Loss and the Focal Loss. The Weighted Binary Cross-Entropy Loss assigns different weights to the classes, so that the model is more penalized for misclassifying the minority class. The Focal Loss is a modification of the Cross-Entropy Loss, that puts more focus on hard to classify examples. Unfortunately, we weren't able to match the performance of the model with the previous approach. Also, with these loss functions, we are introducing new hyperparameters, that turned out to be difficult to tune.
+**Analysis of Wrong Predictions:**
+Another metric that we used to understand our model better was to plot the track for each sample (defined by its ID) how often it was classified wrong over multiple runs. In the plot below we show the number of times a same sample was wrongly classified, given it was a true alarm (False Negative) on the left or a false alarm (False Positive) on the right plot. Interestingly, it seems so misclassify true alarms more often than false alarms, this can hint that the model is not capturing enough aspects of true alarms. 
+  <div style="display: flex; justify-content: space-around;">
+    <img src="./results/bad_profiles/dataset1/frequencies/dataset_1_FN_distribution.png" alt="Bad profiles FN" style="max-width: 40%; height: auto;">
+    <img src="./results/bad_profiles/dataset1/frequencies/dataset_1_FP_distribution.png" alt="Bad profiles FP" style="max-width: 40%; height: auto;">
+</div>
+## BatchNorm Layer Removal
 We realized that the high variability of the model with different data splits could be resolved by removing the BatchNorm layers in the model. We still do not have a clear understanding of why this is the case, but we suspect that the BatchNorm layers are not able to normalize the data properly, because the distributions of the training and testing data are propably too different.
 **Accuracy Histogram with BatchNorm:**
-![Accuracy with BatchNorm](./results/accuracy_histograms/v1_bn.png){width=50%}
+<div style="text-align: center;">
+  <img src="./results/accuracy_histograms/v1_bn.png" alt="Accuracy with BatchNorm" width="60%"/>
+</div>
 **Accuracy Histogram without BatchNorm:**
-![Accuracy without BatchNorm](./results/accuracy_histograms/v1_no_bn.png){width=50%}
+<div style="text-align: center;">
+  <img src="./results/accuracy_histograms/v1_no_bn.png" alt="Accuracy without BatchNorm" width="60%"/>
+</div>
 # Results
 ## Accuracy Histograms
-To get a better overview how well the model performs on the different datasets, we created histograms of the accuracy of the model for different setups. The histograms show the distribution of the accuracy of the model for 100 different training and testing splits.
+For a better overview of how well the model performs on the different datasets, we created histograms of the accuracy of the model for different setups. The histograms show the distribution of the accuracy of the model for 100 different training and testing splits. These can be found below.
-![Accuracy Histograms](./results/accuracy_histograms/4_3.png)
-This shows, that the model performs best on the "v1" version of the dataset (without changes to the WF) and using both temperature and salinity alarms with the corresponding features.
-## Analysis of wrong predictions
+This shows, that the model performs best, in term of accuracy, on the "v1" version of the dataset and using both temperature and salinity alarms with the corresponding features. However, we can also observe that the model performs best, in terms of stability, i.e. lower variance with the "v4" version of the dataset. This is most likely explained by the fact that this dataset has the biggest amount of data which mitigates the effects of overfitting. 
-TODO: Yanice
+<div style="text-align: center;">
+  <img src="./results/accuracy_histograms/4_3.png" alt="Accuracy Histograms" width="70%"/>
+</div>
-## Confusion Matrix Histograms
- Add Analysis
+## Performance Results
-TODO: Boshra (?)
-## Loss/Accuracy curves + confusion matrix
+While the Batch Norm removal didn't impact the overall accuracy and didn't change much the proportion of the confusion matrix, it helped improve the evolution of the losses throughout the training and the distribution of the output values per Confusion Matrix Element. In the below plot, we sum up all the metrics we used to evaluate our model. These values apply for our run of our model using the "v4" dataset with both measures but different train-test splits didn't show more variance.
- Confusion matrix (as % or raw values) As table for 2-3 runs
+<div style="text-align: center;">
- loss/accurcy for 2-3 runs as an example to show overall behavior
+  <img src="./results/final_results/results_v1_data42_model42_batched.png" alt="Accuracy with BatchNorm" width="100%"/>
-TODO: Boshra (?)
+</div>
 # Conclusion and future work
-In this project, we aimed to reduce the manual workload of human operators by developing a machine learning model to classify ocean monitoring alarms into true and false alerts. Using a simplified Multi-Layer Perceptron (MLP) model, we focused on enhancing consistency and reducing model complexity. The removal of BatchNorm layers significantly improved model stability across different data splits. Introducing a confidence threshold system further helped in minimizing incorrect predictions. Our model performed best on the new "v1" dataset, incorporating both temperature and salinity alarms.
+In this project, we aimed to reduce the manual workload of human operators by developing a machine learning model to classify ocean monitoring alarms into true and false alerts. Using a simplified Multi-Layer Perceptron (MLP) model, we focused on enhancing consistency and reducing model complexity. The removal of BatchNorm layers significantly improved model stability across different data splits. Introducing a confidence threshold system further helped minimizing incorrect predictions.
-Future work will focus on the following areas to further improve the model and its deployment:
+Future work will focus on the following areas to further improve the model:
-1. Model Optimization: Further fine-tuning hyperparameters and exploring advanced neural network architectures.
+1. Model Optimization: More hyperparameters fine-tuning and exploring advanced architectures.
 2. Handling Imbalanced Data: Implementing data augmentation and revisiting advanced loss functions.
 3. Explainability: Enhancing feature importance analysis and using model explainability tools.
-4. Integration: Developing a real-time processing system and creating a user-friendly interface for operators.
+4. Integration: Developing a real-time processing system and a user-friendly interface for operators.
\ No newline at end of file
--- a/Report_draft.pdf
+++ b/Report_draft.pdf
--- a/results/final_results/confidence_model_equations.png
+++ b/results/final_results/confidence_model_equations.png
--- a/results/final_results/histogram_confusion_matrix.png
+++ b/results/final_results/histogram_confusion_matrix.png
--- a/results/no_bn/results_merged_data2651_model123331234.png
+++ b/results/no_bn/results_merged_data2651_model123331234.png
--- a/results/no_bn/results_v2_test_data42_model42.png
+++ b/results/no_bn/results_v2_test_data42_model42.png
--- a/results/no_bn/results_v4_data3569_model123331234.png
+++ b/results/no_bn/results_v4_data3569_model123331234.png
--- a/results/no_bn/results_v4_data5397_model123331234.png
+++ b/results/no_bn/results_v4_data5397_model123331234.png
--- a/results/undersampled_train_and_test/results_old_data7953_model123456789.png
+++ b/results/undersampled_train_and_test/results_old_data7953_model123456789.png
--- a/results/undersampled_train_and_test/results_v2-v1_data5799_model123456789.png
+++ b/results/undersampled_train_and_test/results_v2-v1_data5799_model123456789.png
--- a/results/undersampled_train_and_test/results_v2-v1_data8806_model123456789.png
+++ b/results/undersampled_train_and_test/results_v2-v1_data8806_model123456789.png
--- a/results/undersampled_train_and_test/results_v2_v1_data1551_model123456789.png
+++ b/results/undersampled_train_and_test/results_v2_v1_data1551_model123456789.png
--- a/results/undersampled_train_and_test/results_v2_v1_data330_model123456789.png
+++ b/results/undersampled_train_and_test/results_v2_v1_data330_model123456789.png
--- a/results/undersampled_train_and_test/results_v2_v1_data8775_model123456789.png
+++ b/results/undersampled_train_and_test/results_v2_v1_data8775_model123456789.png