Data Outlier Detection using the Chebyshev Theorem - Paper review and online adaptation

This is the first paper review, in a series of paper reviews and implementations I would like to do, on online (or streaming) outlier detection algorithms. I believe with the advent of the Internet of Things it will be an important subject to research and study with all the sensory data to be produced and consumed. All the knowledge from this series will be gathered in the awesome-streaming-outlier-detection repository.

In this first post in the series I am going to present the 2005 paper: Data Outlier Detection using the Chebyshev Theorem by Brett G. Amidan, Thomas A. Ferryman, and Scott K. Cooley.

The paper uses the Chebyshev inequality in order to calculate upper and lower outlier detection limits. These thresholds give a bound to the percentage of data that fall ouside k standard deviations from the mean, while on the same time, the calculations make no assumptions about the distribution of the data. This is important as often the distribution is not known and we don't want to make any assumption over it. The only assumptions the method makes is that the data are independent measurements and that only a small percentage of the outliers are contained in the data.

With an unknown distribution, the Chebysev inequality is:

$P(|X - \mu| \leq k \sigma) \geq (1 - \frac{1}{k^2})$ (1)

and indicates that if k=2 at least 75% of the data would fall within 2 standard deviations from the mean (lower bound). The equation above can also be changed as:

$P(|X - \mu| \geq k \sigma) \leq \frac{1}{k^2}$ (2)

to indicate that at most 25% of the data is outside 2 standard deviations from the mean (upper bound).

There is also a special case when, if we assume that the data is unimodal, (data with only one peak, which can be examined for example by plotting the data) we can use the unimodal Chebyshev's inequality. But since we are talking about streaming data that arrive one after the other, I will assume that nothing is known in advance and continue with the standard case.

From the Chebyshev's inequality an Outlier Detection Value (ODV) is calculated. Any data value that is more extreme than the ODV is considered to be an outlier.

The algorithm

The algorithm follows a two stage process.

Stage 1

The first stage is responsible for trimming the data from values that are possibly outliers.

We decide on a value of $p_1$ , which can be considered as the probability of seeing an expected outlier. We can use values like 0.1, 0.05 or even 0.01.
Solving equation (2) for k we have equation (3). Anything more extreme than k standard deviations is considered a stage-1 outlier. So if $p_1=0.05$ then $k=4.472$ and thus everything more extreme than 4.472 standard deviations will be considered a stage-1 outlier.

$k=\frac{1}{\sqrt{p_1}}$ (3)

Then we calculate ODVs (upper and lower bounds) for stage-1, where $\mu$ and $\sigma$ are the sample mean and the sample standard deviation derived from the data:

$ODV_{1U} = \mu + k \sigma$ (4)

$ODV_{1L} = \mu - k \sigma$ (5)

Data that are more extreme than the ODVs of stage-1 are removed from the data for the second phase of the algorithm. The truncated dataset (i.e. without the outliers) is used to calculate the mean and standard deviation needed for the Chebyshev's inequality.

Stage 2

The second stage derives the final ODVs.

Select a value for $p_2$ , the expected probability of seeing an outlier. Usually smaller than $p_1$ , used to actually determine the outliers. Reasonable values are 0.01, 0.001, 0.0001.
Solve equation (2) for k and get equation (6).

$k=\frac{1}{\sqrt{p_2}}$ (6)

Calculate stage-2 ODVs using equations (4) and (5), where $\mu$ and $\sigma$ are the sample mean and the sample standard deviation derived from the truncated data.
All data (from the complete dataset) that are more extreme than the stage-2 ODVs are considered to be outliers.

Streaming version

Even though the algorithm is not made for streaming data, I will convert it into a streaming algorithm by calculating the running average and variance of the data using Welford's online algorithm and benchmark it using the Numenta Anomaly Benchmark (NAB).

The code for the algorithm can be found in my personal fork of the NAB GitHub repository. It achieves:

18.44 in the Standard Profile
13.18 Reward Low FP
23.21 Reward Low FN

with 100.00 being the perfect score in all three categories. Even though the score maybe low, every record is handled in constant time O(1) and there are no assumptions made whatsoever regarding the distribution of the incoming data.

Below is an example of outliers detected from a time series of AWS EC2 CPU utilization:

nab-chebyshev-ec2

With red circles we have the points labeled as outliers, while with blue crosses are the predicted ones. As one can see the algorithm is capable of identifying unique outliers rather than periods of "outlierish" behavior needed by NAB.

The code for the NAB implementation can be found below:

from nab.detectors.base import AnomalyDetector

import math

class ChebyshevDetector(AnomalyDetector):
    """ An streaming version of the algorithm found in the paper: 
    "Data Outlier Detection using the Chebyshev Theorem"
    using Welford's online algorithm to calculate mean and standard deviation
    """

    def __init__(self, *args, **kwargs):
        super(ChebyshevDetector, self).__init__(*args, **kwargs)
        self.p1 = 0.1 # Stage 1 probability 
        self.p2 = 0.001 # Stage 2 probability 
        self.k1 = 1/math.sqrt(self.p1)
        self.k2 = 1/math.sqrt(self.p2)
        self.n1 = 0
        self.m1 = 0
        self.m1_2 = 0
        self.std1 = 1
        self.n2 = 0
        self.m2 = 0
        self.m2_2 = 0
        self.std2 = 1

    def handleRecord(self, inputData):
        """Returns a tuple (anomalyScore).
        The input value is considered an outlier if it resides outside the Outlier Detection Values (upper or lower).
        The anomalyScore is calculated based on the normalized distance the input value has from the upper or lower 
        ODVs, if the input value is considered and outlier, otherwise it is 0.0.
        The probabilities p1 and p2 have been tuned a bit to give good performance on NAB.
        """
        anomalyScore = 0.0
        inputValue = inputData["value"]
        # stage 1 statistics
        self.n1 += 1
        delta = inputValue - self.m1
        self.m1 += delta/self.n1
        self.m1_2 += delta * (inputValue - self.m1)
        self.std1 = math.sqrt(self.m1_2/(self.n1-1)) if self.n1-1 > 0 else 0.000001
        odv1_high = self.m1 + self.k1 * self.std1
        odv1_low = self.m1 - self.k1 * self.std1
        if inputValue <= odv1_high and inputValue >= odv1_low:
            # Passed the first test, let's calculate the second stage statistics
            self.n2 += 1
            delta = inputValue - self.m2
            self.m2 += delta/self.n2
            self.m2_2 +=  delta * (inputValue - self.m2)
            self.std2 = math.sqrt(self.m2_2/(self.n2-1)) if self.n2-1 > 0 else 0.000001
        odv2_high = self.m2 + self.k2 * self.std2
        odv2_low = self.m2 - self.k2 * self.std2
        if inputValue > odv2_high:
            ratio = (inputValue - odv2_high)/inputValue
            anomalyScore = ratio
        elif inputValue < odv2_low:
            ratio = abs((odv2_low - inputValue)/odv2_low)
            anomalyScore = ratio
        return (anomalyScore, )

The algorithm

Stage 1

Stage 2

Streaming version

Comments

About me

AMA

Practical Machine Learning in R

Follow

kyrcha.info subdomains

Subscribe to the blog

Blog Categories

Let's work together