PBAD: Pattern-Based Anomaly detection
Implementation of Pattern-Based Anomaly Detection in Mixed-Type Time Series, by Vincent Vercruyssen and Len Feremans.
Paper published at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2019 (ECML-PKDD).
The present-day accessibility of technology enables easy logging of both sensor values and event logs over extended periods. In this context, detecting abnormal segments in time series data has become an important data mining task. Existing work on anomaly detection focuses either on continuous time series or discrete event logs and not on the combination. However, in many practical applications, the patterns extracted from the event log can reveal contextual and operational conditions of a device that must be taken into account when predicting anomalies in the continuous time series. This paper proposes an anomaly detection method that can handle mixed-type time series. The method leverages frequent pattern mining techniques to construct an embedding of mixed-type time series on which an isolation forest is trained. Experiments on several real-world univariate and multivariate time series, as well as a synthetic mixed-type time series, show that our anomaly detection algorithm outperforms state-of-the-art anomaly detection techniques such as MatrixProfile, Pav, Mifpod and Fpof.
An interactive tool for TIme series Pattern Mining (TiPM) that contains PBAD, is available here.
Summary
PBAD takes a mixed-type time series as input. A mixed-type timeserie consist of multiple continuous time series, together with one or more discrete event logs. PBAD computes an anomaly score for each window without the need for labels.
PBAD consist of 4 major steps:
- Preprocessing univariate, multivariate, and mixed-type time series.
- Mining a (non-redundant) set of itemsets and sequential patterns from each time series.
- Constructing an embedding of all time series based on distance-weighted pattern occurrences.
- Detecting anomalies using an isolation forest.
Installation
- Clone the repository
- Code is implemented in
Python
, but some performance-critical code is implemented inC
usingCython
. Build the Cython code by running the setup.py file:
cd src/utils/cython_utils/
python setup.py build_ext --inplace
Usage
PBAD consists of methods.PreProcessor
for pre-processing and methods.PBAD
for predicting contextual anomalies.
Parameters for methods.PreProcessor
are:
-
window_size
andwindow_incr(ement)
control the creation of fixed sized sliding windows in discrete and continous time series. -
bin_size
can be used for downsampling a continuous timeseries using a moving average. -
alphabet_size
controls the number of bins for equal-width discretisation.
Parameters for methods.PBAD
are:
-
relative_minsup
controls the amount of itemsets and sequential pattern generated (default is 0.01) -
jaccard_threshold
controls the filtering of redundant patterns (between 0.0 and 1.0) -
pattern_pruning
is eithermaximal
orclosed
(default ismaximal
) -
pattern_type
is eitheritemset
,sequential
orall
(meaning both types of patterns, default isall
)
We illustrate both classes in the following example:
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score
from methods.PreProcessor import PreProcessor
from methods.PBAD import PBAD
# Univariate input file has three columns: timestamp, value and label.
# Label is either 0=unknown, 1=normal or -1=abnormal
# timestamp,value,label
# 2013-07-04 00:00:00,0.43,0
# 2013-07-04 01:00:00,0.48,0
input_file = './univariate/ambient_temperature/train_data.csv'
# 1. preprocess the data
univariate_data = pd.read_csv(input_file, header=0, index_col=0) #index on timestamp column
ts = {0: univariate_data.iloc[:, 0].values} #value column
labels = univariate_data.iloc[:, 1].values #label column
preprocesser = PreProcessor(window_size=12, window_incr=6, alphabet_size=30)
ts_windows_discretized, ts_windows, _, window_labels = preprocesser.preprocess(continuous_series=ts, labels=labels,
return_undiscretized=True)
# 2. run PBAD on the data
pbad = PBAD(relative_minsup=0.01, jaccard_threshold=0.9, pattern_type='all', pattern_pruning='maximal')
scores = pbad.fit_predict(ts_windows, ts_windows_discretized)
# 3. evaluation on labeled segments
filter_labels = np.where(window_labels != 0)[0]
print('AUROC =', roc_auc_score(y_true=window_labels[filter_labels], y_score=scores[filter_labels])) #AUROC = 0.997
More information for researchers and contributors
The current version is 0.9, and was last update on september 2019. The main implementation is written in Python
. Performance-critical code, mainly for computing the embeding based on weighted pattern-based occurrences, is implemented in C
using Cython
. For mining closed, maximal and minimal infrequent itemsets and sequential patterns we depend on the Java
-based SPMF library. Python Dependencies are numpy==1.16.3
, pandas==0.24.2
, scikit-learn==0.20.3
, Cython==0.29.7
and scipy==1.2.1
.
We compare performance with the following state-of-the-art methods, which we implemented:
-
FPOF
: Fp-outlier: Frequent pattern based outlier detection. -
PAV
: Multi-scale anomaly detection algorithm based on infrequent pattern of time series. -
MIFPOD
: Minimal infrequent pattern based approach for mining outliers in data streams. -
MP
: Matrix profile, all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets for anomaly detection.
Datasets are provided in /data:
-
univariate
New york taxi, ambient temperature, and request latency. Origin is the Numenta repository. -
multivariate
Indoor physical exercises dataset captured using a Microsoft Kinect camera. Origin is AMIE: Automatic Monitoring of Indoor Exercises. -
mixed-type
Synthetic power grid dataset. Seeexperiments
.synthetic_mixed_type_data_generator
for details.
To run experiments that compare PBAD with state-of-the-art methods run experiments
.reproduce_experiments
.
Contributors
- Vincent Vercruyssen, DTAI research group, University of Leuven, Belgium.
- Len Feremans, Adrem Data Lab research group, University of Antwerp, Belgium.
Licence
Copyright (c) [2019] [Len Feremans and Vincent Vercruyssen]
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFWARE.