# Detecting Anomalies in Correlated Time Series

January 25, 2017 3 Comments

Monitoring key performance indicators (KPIs), sales or any other product data means working within an ecosystem where very often you will see metrics correlating with each other. When a normal correlation between two metrics is broken, we have reason to suspect something strange is happening.

As an example, take a look at anomaly.io analytics during its early days (a long time ago). In the graphic above, the new users are shown in green and the returning users in red. Clearly, something strange happens in the middle of November. Let’s use some techniques to find out more!

## Step 1: Collect the Data

With Google Analytics it’s fairly easy to download data on new and returning visitors. Here at Anomaly.io we detect problems in real time, so we prefer streaming our Google Analytics in real time. But for the sake of simplicity, we will use a plain CSV file for this exercise.

You can download your own Google Analytics data, or use my sample monitoring data file if you wish: new-vs-returning-visitor.csv.

Now you have two variables in your R environment:

• new_df: new visitors
• return_df: returning visitors

## Step 2: Verify the Cross-Correlation

Our analysis is based on the time series being correlated, so before going any further, let’s ensure that this is the case. To do so, we need to check using Cross-Correlation. Check this article if you want to Understand the Cross-Correlation algorithm.

As the Moving Average is robust to anomaly we use it to remove potential outliers before computing the correlation. The Cross Correlation Function (CCF) is a very high value of 0.876. Clearly, the time series are correlated.

## Step 3: Subtract the Time Series

We can compute that there are 1.736 times more new visitors than returning visitors. Let’s align the metrics at the same level.

## Step 4: Find Outliers in Correlated Time Series

The resulting signal looks like “noise”. Plotting the histogram show a normally distributed signal. We already know how to detect anomalies in a normally distributed time series using the 3-sigma rule.

The 3-sigma rules express that nearly all values are taken to lie within three standard deviations of the mean. Everything outside this range can be treated as anomalous.

## Conclusion

Looking at the initial retention graph, I was able to spot 1 or 2 anomalies when in reality I had 7. I actually remember back in mid-November when I received the automatic anomaly report. I started digging into the problem, and found out that someone had shared my twitter anomaly review on Hacker News.

Monitor & detect anomalies with Anomaly.io

• Bob Ziti

Wouldn’t you get the same result if you just looked into the ‘new users’ time serie by itself?
What additional perspective are you getting by looking at the aggregate of both time series?

• http://www.rozumim.cz/ ytus

I rewrote this in Python as an exercise and the answer to your question is: you will *not* get the same results.
My implementation is here: https://notebooks.azure.com/anon-te6iza/libraries/anomaly-io-in-python/html/anomaly.io.ipynb see the last cell [35] and the chart bellow with only four anomalies.

• Bob Ziti

This is very interesting, thank you!

Have you ported any of the other blog posts to Python? also, is there any book on anomaly detection you’d personaly recommend?

help with term papers