Anomaly Detection with the Poisson Distribution

June 12, 2015 5 Comments detection, math

anomaly detection with poisson distribution

Instinctively we look at anomalies by examining the number of events during a fixed period of time. However, this method can’t achieve fast detection rates, and fails to detect some anomalies. The Poisson distribution turns the problem upside-down by looking at the period of time for a fixed number of events.

Focus on Time Per Event

Traditional Anomaly Detection

A natural way of thinking about anomaly detection is to look at the number of events during a fixed period of time. After the aggregation process is finished, the accumulator reveals the final count of events during the fixed period. Comparing this number to our model (or prediction) will tell us if the value is anomalous.

Some examples of events per period of time:

  • Number of pageviews per second on a website
  • Number of sales per minute
  • Number of doors opening per hour

The Problem

When it comes to anomaly detection, there are two major problems when using the “number of events per unit time” approach.

  1. The first problem is the aggregation process. Until the period ends, the count of events isn’t available, so anomaly detection can’t be performed. For some metrics, this period can be very long, and any anomalies will only be detected after this long delay. Small e-commerce websites usually monitor their sales per hour. In the worst case scenario, a breakdown in the payment system would be detected after a delay of two hours. This often just isn’t acceptable.
  2. The second problem is due to the normal fluctuations of counts. Due to randomness, a count might just happen to be lower than usual. Many other counts will be required to detect if this low value is indeed due to randomness or an actual anomaly. In fact, to detect small changes many periods needs to be accumulated. This might occur on an e-commerce website when only one payment method stops working. To detect the small change in sales, hours or even days will be required. This is almost never acceptable.

The Solution: Poisson Distribution

The Poisson distribution looks at the period of time for a fixed number of events. The Poisson distribution formula gives the event probabilities when the average rate is known.

Example 1: Poisson is Faster

Between 4 pm and 5 pm a website generates 70 pageviews per minute on average (λ = 70). What is the probability of having exactly 60 pages view in one minute (x = 60)? Using R and the Poisson distribution formula, the probability of exactly 60 pages view in one minute is 4.1%

What is the probability of exactly 0 pageviews (x = 0) in 15 seconds (λ = 70/4)?

There is only a 0.00000251% chance for this to occur. It can happen due to randomness, but because it’s so unlikely, we can consider it as unusual. An anomaly!

Probability of 45 or more pageviews (x >= 45) in 15 seconds (λ = 70/4):

ppois function sums the probability of 45, 46, 47 … ∞

Probability of 10 or more pageviews (x >= 10) in 1 second (λ = 70/60):

Those almost impossible probabilities define rules for anomaly detection:

  • At least 1 event every 15 seconds
  • No more than 45 events in 15 seconds
  • No more than 10 events per second

Previously, the monitoring was made per minute. It would have required at least 1 minute before applying an anomaly detector. Now it can detect anomalies much faster, requiring 1 to 15 seconds. This is possible because for high rate λ the Poisson distribution follows the normal distribution.poisson normal distribution

Example 2: Poisson is Sensitive to Small Changes

A small e-commerce website is selling 3 products per days (λ = 3). Doing the calculations, it will be unlikely to sell 0 products within 1.5 days.

Still, it is possible. Due to randomness, after 1 year (365 days * 0.011109 * 1.5) we might have detected 6 anomalies that aren’t real, which we also call “false positives.” But hey, this is our sales! It is worth the extra care of checking to make sure that the payment system is still working.

The Poisson distribution is sensitive to small changes. Under low rates, it would have been hard to detect an anomaly using the normal distribution without accumulating many days of sales records. This is because for low rate λ the Poisson distribution doesn’t follow the normal distribution.poisson not normal distribution

poisson formula

  • x: exact number of successes
  • e: Constant equal to approximately 2.71828
  • λ: mean of successes in the given time interval

Implementation

In a previous example, we studied the Poisson distribution of a web server providing 70 pageviews per minute. Every 100 ms we check our 3 established rules for the anomaly detection:

  • At least 1 event occurs every 15 seconds
  • No more than 10 events occur in 1 second
  • No more than 45 events occur in 15 seconds

Here is a simple prototype written in JavaScript:

Using the anomaly detection:

This is just a proof of concept, and many optimizations are needed and additional work remains to be done. For example, it should include many more rules in the model. This implementation is too permissive, so some anomalies will not be detected.

Warning !

Usually λ Changes Over Time

Server monitoring usually exhibits some seasonal patterns. Very often the average rate λ isn’t the same during day and night, and it is also often different between weeks and weekends. In this kind of seasonality, λ can be chosen with a time shift of one week. The rate λ at “day 1 hour 1″ might be the same at “day 7 hour 1″. Shifting by one week might also capture the natural growth of the metric, as well as the annual pattern.

Finding λ isn’t always as easy as shifting by one week. In extreme seasonality, machine learning can be trained to follow complex patterns.

Check the Poisson distribution

The Poisson distribution is reliable under low event rates and fast to detect anomalies. However, it is important to make sure the metrics follow the Poisson distribution.

To have a Poisson distribution:

  • Events need to occur independently
  • The probability that an event occurs in a given length of time cannot change over time. (fixed λ)

In server monitoring, this isn’t always the case, due to the underlying nature of computer programs. For example, if one process locks a database record, any other accesses to that record must wait in a queue. This makes these events dependent, and so these events don’t follow the Poisson distribution.

Count Events, Not Values

A very common mistake is to use values as events to fit the Poisson distribution. For example, while it is possible to apply the Poisson distribution to the number of sales, you cannot use it for the value of sales.

Monitor & detect anomalies with Anomaly.io

SIGN UP
  • http://undefinedblog.com/ jason

    Thanks for the brilliant article!

    There is a small typo in section Example 1: Poisson is faster, the lamada should be 65 to get a probability of around 4.1%.

  • ravi

    Hi ,

    I am currently working for client he has been made payment to their employees every week and that time their huge volume of transaction files passing through the bank app but it creates backlog and takes time to get delivered for that i need to predict that. I have data of every 15 mins cpu and memory utlization and scan per pages per second and all. How to do prediction using either poison or timeseries. besides i need to dom predict maintenance that usuall downtime will get reduce? Pl help me how to address it and let me know if u require more details on it?

    • martin magakian

      I’m not sure I understand you need. Do you want to contact me by email instead?

      • ravi

        Hi Martin,

        I elaborate the use case here.
        Predict CPU, Memory & disk space failures during transaction processing.
        – A scenario would be clients do processing salary to the customers through the bank app (this is my client) occurs every week and the Volume spikes for the main server component and the Messaging
        – ie. client sends payment files for processing for its customers every Tuesday. Large number of transaction volume spike causes backlogs and processing delays across the Channels how to rectify it?

        Data: I have data for 1 month with time stamp of every fifteen minutes for cpu and memory utilization, scan pages per second, virtual memory utilization and all.

        we need to find anomaly and do prediction on volume of transactions also for future period and our prediction should help for predictive maintenance to reduce the down time.

        In case of any additional clarification, pl revert me.

        Pl suggest some approach/models, Thanks,
        Ravi

      • ravi

        pl share your email

help with term papers