Delta Time (Δt) for Anomaly Detection

June 17, 2015 3 Comments detection, math

delta time anomaly

It is common to monitor the number of events that occur in a period of time. Unfortunately, this technique isn’t fast, and can fail to detect some anomalies. The alternative is to change the problem to studying the period of time between events.

Poisson Distribution Doesn’t Work

Traditional monitoring records the number of events for a period of time. There are two major problem with this approach:

  • Due to randomness, it is difficult to spot an anomaly
  • When the event rate is low, detection becomes very slow

Previously, we went into details and solved both problems by studying the period of time between events using the Poisson distribution to detect anomalies (please read this article, if you haven’t yet, before continuing). To apply the Poisson distribution we need to check if we are monitoring a Poisson process. For this to be the case, two conditions must hold.

1. Events need to occur independently

In server monitoring, events don’t always occur independently due to the underlying architecture. For example, when a user writes to a database table, other users can’t access the same table at the same time. They must wait in a queue. When the writer finishes, the queue is processed. This clearly shows a dependency of events so the Poisson distribution can’t be applied.

2. Average rate does not change

It is very common to see seasonal patterns in monitoring. Many servers are very busy during the day and much less at night. This change in rate doesn’t follow the Poisson process. A way to deal with this problem is to compare the average rate one week ago at the same time to predict today’s rate.

Using Percentile of Δt

Poisson anomaly detection was a perfect fit for analyzing time between events. Unfortunately, it doesn’t work in our case because our data aren’t Poisson distributed. Should we stop studying event rates and go back to event counts? Wait! Don’t give up yet, because there is a simple trick we can use. We are actually only interested in extremely rare event rates. So it doesn’t matter if this data isn’t Poisson distributed or normally distributed to detect anomalies. What really matters is to have a lower tail and/or upper tail with a very rare probability distribution.

The main idea is to monitor the interval of time between the event at time t and the previous event at time t-1. The time difference between both events is known as Δt = t – t-1 .We can also monitor the time difference for the n previous events, which is Δt = t – t-n. Using n=1 will provide the fastest detection. Using n>1 allows detection of smaller rate variations, but requires some delay since it needs to wait for n events before anomaly detection.

delta time arrival

Download the data set recording event timestamps in milliseconds and the computed n=1, n=5 and n=10.

  • The n=1 histogram doesn’t show any tail. Based on the distribution, the min and max values look impossible. But, as we don’t know the nature of the data, we will still apply min and max thresholds as a precaution.
  • Interestingly, the n=5 and n=10 histograms do show a tail. It is neither Poisson distributed nor normally distributed, but extremely low or high values can be considered anomalies.

delta time histogram n=1

delta time histogram n=5

delta time histogram n=10

Using the 0.001 and 0.999 percentiles:

  • for n=5, the percentiles are min=332 and max=1496
  • for n=10, the percentiles are min=354 and max=1304

Please note that in this example data set the first 1500 events are normal before an anomaly append.

Draw the data Δt and thresholds for different n:

delta time graphNote: Aggregated value was divided by n to appear on the same graph. The logic stay unchanged.

A small anomalous rate change occurs at the end of the graph:

  • As discussed before, n=1 is very reactive, but it fails to detect the anomaly as every value are above its minimum limit.
  • At the end of the graph, n=5 goes under its thresholds. It requires more delay than n=1 but at least it detected the anomalous change in the event rate.
  • n=10 required even more time to detect the anomaly. It isn’t useful in this example, because it is far too slow. However, it could have been used to detect a change in rate that n=5 wouldn’t have found.

As already explained, n=1 provide the fastest detection, but fails to detect smaller unusual rate variations. The bigger n is, the more anomalies it will detect by smoothing the variations, unfortunately at the price of speed.

Including Event Rate λ

Usually, some change in rate will be noticed during monitoring. The event rate might even change a lot between day and night. Previously, we omitted such seasonality to calculate the thresholds, but as it happens, it is easy to integrate the event rate λ and to normalize the result. It is as simple as multiplying the event rate λ by the events’ time difference Δt. The result is σ = λ * Δt. The value σ is just a normalized version of  Δt, so we can apply the percentile technique exactly how we did before.

The main difficulty is to have a correct estimate of λ. Based on the metric history, it might be possible to pick the previous rate exactly one week from now; this can work perfectly in some cases. In complex cases, picking λ might be hard, but that is beyond the scope of this article.

As explained in the chart, for each incoming event t :

  1. Select the event rate λ as it was one week ago
  2. Calculate the time difference Δt between event t and event t-n
  3. Multiply Δt and λ to find σ
  4. if σ is lower than the 0.001 percentile or higher than 0.999 percentile the event is an anomaly!

delta time flow detection

Implementation

Previously we learned how to make a Δt anomaly detection. Let’s now build a simple prototype in Java 8 using Joda time and CircularFifoQueue from Apache collections4. CircularFifoQueue is a “first in first out” queue that retains only the last x elements inserted: very useful for storing our n events.

As explained above, when an event occurs, we check if the rate stays between the min and max percentile. But how do we detect a total cessation, when no events occur at all? The trick is to check the queue every 10 ms by introducing a virtual event with the current date. This virtual event doesn’t get stored in the queue; it is only used to check every 10 ms if the rate goes too low. In the following implementation, this is the autoCheck.

The Detection class takes three arguments:

  • List<Rule> min and max percentile of n
  • RateHistory predicts the current rate λ
  • autoCheck checks every 10 ms if the min rate is reached

The heart of the code is in the small detect function that calculates and checks σ from Δt and  λ.

Let’s try this code with two already calculated n:

  1. n=5 with a min and max percentile 3650 and 4750
  2. n=10 with a min and max percentile 8100 and 9800

For this example, we created a FakeRate class. It will predict a rate of λ=10 events during daytime and λ=1 events during nighttime.

First anomaly:

  • n=5 detects the anomaly quite fast
  • n=10 requires 4 extra events to detect it

Second anomaly:

  • The change in rate is too small for n=5 to detect
  • n=10 detects the problem after 12 (small) anomalous events

When transiting to “daytime,” many anomalies are detected. This is because our rate becomes 10x faster. In reality, this should be a smoother transition, so it shouldn’t trigger any alert.

The third anomaly:

  • n=5 detects the anomaly quickly
  • n=10 requires 2 extra events to detect it

In the last part, we create an anomaly detection without sending any event. Since the autoCheck is enabled (as it always should  be), it detects the low rate within 10 ms.

Conclusion

As discussed in detail when building an anomaly detection using the Poisson-distribution, focusing on the event rate will detect anomalies very quickly. Under high rates, a few milliseconds can be enough to spot an anomaly.

Anomaly detection studies outliers: It doesn’t matter what the distribution is as long as outliers are detected. When plotting the distribution, a tail or two might become visible. The upper tail can be used to detect a “too high” rate change, the lower tail a “too low” rate change. If no tail is present, choose higher values of n. Still no tail in sight? Maybe you should consider other anomaly detection techniques…

The key to success of this anomaly detection technique is to very carefully choose the rate predictor λ. As every calculation depends on the λ prediction, picking the wrong value will result in many false positives.

Monitor & detect anomalies with Anomaly.io

SIGN UP
  • Xin Zhang

    The dataset can not download

  • Ted Dunning

    If you would like to know where these techniques came from, you can check out my book:

    https://mapr.com/practical-
    https://www.slideshare.net/

    Fair credit would be nice.

  • Tauvicr

    Very useful article, but i cannot find the dataset mentioned and thats essential if we want to re-run these examples. Can you please add it?

help with term papers