Anomaly Detection Using K-Means Clustering

kmean anomaly

Monitored metrics very often exhibit regular patterns. When the values are correlated with the time of the day, it’s easier to spot anomalies, but it’s harder when they do not. In some cases, it is possible to use machine learning to differentiate the usual patterns from the unusual ones.

Regular Patterns

Many metrics follow regular patterns correlated to time. Websites, for example, commonly experience high activity during the day and low activity at night. This correlation with time makes it easy to spot anomalies, but it’s more difficult when that correlation is not present.

1. With Time Correlation

When the metric is correlated to time, the key point is to find its seasonality. When today’s pattern is the same as yesterday, the seasonality is daily. It is also common to find seasonality of one week because Saturday’s patterns often don’t follow Friday’s, but rather those of the Saturday of the previous week. Because many metrics are correlated to human activity, seasonality is often weekly. But it could also be measured in seconds, minutes, hours, month or an arbitrary period.

Subtracting the usual pattern (seasonality) from the overall signal should create a new “difference metric”. Due to randomness, the resulting signal should be pure noise with a mean of zero. Under such conditions, we can apply the normal distribution to detect anomalies. Any values that are too high or too low can be treated as anomalous.

substract seasonality

2.Without Time Correlation

A heartbeat has many recurring patterns. The heart beats on average every 0.8s, but this is an average period, not a fixed period of time. Moreover, the period and the value of the signal might change a lot due to physical activity, stress or other effects. So it isn’t possible to just use a period of 0.8 seconds.

So what can we do? Machine learning to the rescue!

Machine Learning Detection

We can group similar patterns into categories using machine learning. Then, we subtract each new beat with its closest category. If the original beat and the category beat are very similar, the result should be pure noise with a mean of zero. As above, it’s now possible to apply the normal distribution to detect anomalies. For an anomalous beat, even the closest category will still be very different. The anomaly will be easy to detect as it will create a peak in the “difference metric”.

This requires 4 steps:

  1. Sliding Window
  2. Clustering
  3. Noise Transform
  4. Detect Anomalies

1. Sliding Window

The first step is to “cut” the normal heartbeat signal into smaller chunks. To get better results later, we are going to use a sliding window with an overlap of 50%. For each window, we will apply a Hamming function to clean the signal.
In the following heartbeat implementation, we choose a sliding windows of 60 values represented with a red rectangle. So, we will move the window by 30 values. For each window, we multiply the heartbeat signal (in green) by the hamming function (in red). The end result (in green) is saved to create categories in the next step.

hamming sliding window

2. Clustering

The next step is to group together similar patterns produced by the sliding window. We will use one machine learning technique known as k-means clustering using Matlab/Octave or Mahout. This will cluster our signal into a catalogue of 1000 categories. In the following schema, some categories are plotted.

kmean clusters2

kmean clusters2

3. Noise Transform

For each sliding window, we select the closest category from our catalogue of patterns created previously. Subtracting the sliding window from its category should produce a simple noisy signal. If the signal is repeating and enough clusters were chosen, the whole signal can be turned into a normally distributed noisy signal with a mean of zero . Because this noisy signal is normally distributed we can apply the normal distribution to detect anomalies. We can plot its histogram, or use the 99.999% percentile, while highlighting the threshold limit to apply on the noisy signal.

substract window to cluster

normal distribution

kmean histogram

4. Detect Anomalies

For each new sliding window, we repeat the above process to turn it into a noise signal. Previously we defined minimum and maximum thresholds. The noisy sliding window should be within those limits; if not, this is an anomaly! But that’s not all. We also defined the noise signal as being normally distributed with a mean of zero. If it suddenly changes into another distribution, something unusual or anomalous is appending.

heart anomaly animationMin / Max anomaly

graph normal distribution changeddistribution anomaly

normal distribution changeddistribution anomaly

Implementation

1. Requirement

First, you need to download and install Octave (recommended) or Matlab. A few packages are will be used in this tutorial. Open an Octave or Matlab shell, install and import these packages:

2. Sliding Window

You could use the WFDB Toolbox from physionet.org to download a heartbeat signal and convert it into Octave/Matlab format. But as it’s easier to simply download the converted file “16773m.mat”

Load the signal in Octave/Matlab , convert into sliding windows, and apply the Hamming function:

All sliding windows are saved in memory into the “cleanPoints” variable and also the “heatbeat.txt” text file. The variable will be used for clustering into Octave/Matlab; the text file will be used for clustering using Mahout.

3. Clustering

Let’s now find common patterns from the signal. We will use the k-means clustering technique, which is part of the machine learning field. It works by grouping together points based on their nearest mean.

.

using octave or mahout

.

3.1. Clustering with Octave or Matlab

When data can fit into RAM, Octave or Matlab is a good choice. When clustering a small quantity of data, such as this heartbeat signal, you should use Octave or Matlab.

Octave and Matlab come with a k-means implementation in the statistics package. As it’s quite slow we will use a faster implementation call Fast K-means. Download fkmeans.m and save the file into the directory of your Octave / Matlab workspace. Use the “pwd” command to find your workspace path.

If you save fkmeans.m in your current workspace, you should be able to create 1000 clusters:

It should take around 30 minutes, so let’s save the file, just in case.

If you have any trouble creating this file, you can download 1000_clusters_fkmeans.mat and simply load it within Octave or Matlab:

3.2. Clustering with Mahout

When the data you went to cluster is huge and won’t fit into memory, you should choose a tool designed for this use case, such as Mahout.

First, download and install Mahout. Previously we created a file call “heatbeat.txt” containing all the sliding windows. You have to copy this file into a directory call “testdata”. Then, run the Mahout clustering from your shell:

A few hours later, the clustering result will be written into the “out_1000″ directory. Now export the Mahout result into Octave or Matlab. Let’s use cluster2csv.jar to convert this Mahout file into a CSV file.

Finaly, let’s import 1000_clusters_mahout.csv into Octave or Matlab:

4. Noise Transform

To transform the original signal into a noise signal, we need a few steps. First, we need a function that can find the closest cluster for a given signal window:

We need to subtract the window from the closest cluster:

And finally, a function to reconstruct the “difference signal”, also call the “error signal”:

With a sample, let’s apply the above function to draw the error signal:

graph error signal

5. Detect Anomaly

Let’s simulate the beginning of a heart attack and detect the anomaly.

heart attack

Clearly, there is a huge difference between the heart attack and its closest pattern. This plot highlights the anomaly.

heart attack diff

Previously we established more rules than a simple minimum and maximum. The noisy “error signal” should follow a mean of zero with a specific normal distribution. Let’s simulate this kind of dysfunction.

not-signal-normal-distribution

It doesn’t look normally distributed anymore: for sure, it’s not the same distribution as it usually is. It’s an anomaly.

not normal distribution

Conclusion

This technique is useful only under certain conditions. You should apply this technique if the signal doesn’t repeat itself within a fixed period of time. If it does, such as daily or weekly, it is preferable to simply model the seasonality to detect anomalies.

We have only scratched the surface here when studying the error signal as being normally distributed. In reality there are more ways of detecting anomalies in a normally distributed signal.

Monitor & detect anomalies with Anomaly.io

SIGN UP
help with term papers