Detecting Anomalies in Correlated Time Series

anomaly in time series correlation

Monitoring key performance indicators (KPIs), sales or any other product data means working within an ecosystem where very often you will see metrics correlating with each other. When a normal correlation between two metrics is broken, we have reason to suspect something strange is happening.

As an example, take a look at anomaly.io analytics during its early days (a long time ago). In the graphic above, the new users are shown in green and the returning users in red. Clearly, something strange happens in the middle of November. Let’s use some techniques to find out more!

Step 1: Collect the Data

With Google Analytics it’s fairly easy to download data on new and returning visitors. Here at Anomaly.io we detect problems in real time, so we prefer streaming our Google Analytics in real time. But for the sake of simplicity, we will use a plain CSV file for this exercise.

You can download your own Google Analytics data, or use my sample monitoring data file if you wish: new-vs-returning-visitor.csv.

google-analytics-correlation

#R language
install.packages("dplyr")
library(dplyr)

#load data
csv = read.csv("/path/to/new-vs-returning-visitor.csv")

#load new visitor
new = filter(csv,Segment=="New Users")
new_data = new[4]$Sessions
new_date = as.Date(lapply(new[1], as.character)$Day_Index, "%m/%d/%y")
new_df = data.frame(date=new_date, value=new_data)

#load returning visitor
return = filter(csv,Segment=="Returning Users")
return_data = return[4]$Sessions
return_date = as.Date(lapply(return[1], as.character)$Day_Index, "%m/%d/%y")
return_df = data.frame(date=return_date, value=return_data)

#plot
plot_df = function(new_df, return_df = NULL, ylim=c()) {
	plot(value ~ date, new_df, xaxt="n", type="l", col="#00c7b0", ylim=ylim, lwd=2)
	date_axis = new_df$date[1:(length(new_df$date)/30.5)*30.5]+2
	axis(1, date_axis, format(date_axis, "%b"), cex.axis = .7)
	if(!is.null(return_df)) {
		lines(value ~ date, return_df, xaxt="n", type="l", col="#ff0816", lwd=2)
	}
}
plot_df(new_df, return_df)
legend("topleft", c("new visitor","returning visitor"),
      col=c("#00c7b0","#ff0816"), lty=c(1,1))

#R language

install.packages("dplyr")

library(dplyr)

#load data

csv = read.csv("/path/to/new-vs-returning-visitor.csv")

#load new visitor

new = filter(csv,Segment=="New Users")

new_data = new[4]$Sessions

new_date = as.Date(lapply(new[1], as.character)$Day_Index, "%m/%d/%y")

new_df = data.frame(date=new_date, value=new_data)

#load returning visitor

return = filter(csv,Segment=="Returning Users")

return_data = return[4]$Sessions

return_date = as.Date(lapply(return[1], as.character)$Day_Index, "%m/%d/%y")

return_df = data.frame(date=return_date, value=return_data)

#plot

plot_df = function(new_df, return_df = NULL, ylim=c()) {

plot(value ~ date, new_df, xaxt="n", type="l", col="#00c7b0", ylim=ylim, lwd=2)

date_axis = new_df$date[1:(length(new_df$date)/30.5)*30.5]+2

axis(1, date_axis, format(date_axis, "%b"), cex.axis = .7)

if(!is.null(return_df)) {

lines(value ~ date, return_df, xaxt="n", type="l", col="#ff0816", lwd=2)

}

plot_df(new_df, return_df)

legend("topleft", c("new visitor","returning visitor"),

col=c("#00c7b0","#ff0816"), lty=c(1,1))

anomaly correlation R

Now you have two variables in your R environment:

new_df: new visitors
return_df: returning visitors

Step 2: Verify the Cross-Correlation

Our analysis is based on the time series being correlated, so before going any further, let’s ensure that this is the case. To do so, we need to check using Cross-Correlation. Check this article if you want to Understand the Cross-Correlation algorithm.

BONUS: Find correlated metrics in multiple times series

library(stats)
library(forecast)
ma_new_data = ma(new_data, 3)
ma_new_data = ma_new_data[!is.na(ma_new_data)]
ma_return_data = ma(return_data, 3)
ma_return_data = ma_return_data[!is.na(ma_return_data)]

corr = ccf(ma_new_data, ma_return_data, lag.max = 0)
corr # display 0.876

library(stats)

library(forecast)

ma_new_data = ma(new_data, 3)

ma_new_data = ma_new_data[!is.na(ma_new_data)]

ma_return_data = ma(return_data, 3)

ma_return_data = ma_return_data[!is.na(ma_return_data)]

corr = ccf(ma_new_data, ma_return_data, lag.max = 0)

corr # display 0.876

As the Moving Average is robust to anomaly we use it to remove potential outliers before computing the correlation. The Cross Correlation Function (CCF) is a very high value of 0.876. Clearly, the time series are correlated.

Step 3: Subtract the Time Series

We can compute that there are 1.736 times more new visitors than returning visitors. Let’s align the metrics at the same level.

multi = sum(ma_new_data/ma_return_data)/length(ma_return_data)
multi #display 1.736529
align_return_df = return_df
align_return_df$value = align_return_df$value*multi
plot_df(new_df, align_return_df)
legend("topleft", c("new visitor","returning visitor x1.736"), 
     col=c("#00c7b0","#ff0816"), lty=c(1,1))

multi = sum(ma_new_data/ma_return_data)/length(ma_return_data)

multi #display 1.736529

align_return_df = return_df

align_return_df$value = align_return_df$value*multi

plot_df(new_df, align_return_df)

legend("topleft", c("new visitor","returning visitor x1.736"),

col=c("#00c7b0","#ff0816"), lty=c(1,1))

align correlation

substract = return_df
substract$value = new_df$value - return_df$value*multi
plot_df(substract)

substract = return_df

substract$value = new_df$value - return_df$value*multi

plot_df(substract)

subtract

Step 4: Find Outliers in Correlated Time Series

The resulting signal looks like “noise”. Plotting the histogram show a normally distributed signal. We already know how to detect anomalies in a normally distributed time series using the 3-sigma rule.

hist(substract$value, breaks = 50)

1	hist(substract$value, breaks = 50)

histogram normal distribution

The 3-sigma rules express that nearly all values are taken to lie within three standard deviations of the mean. Everything outside this range can be treated as anomalous.

#compute 3 sigma minimal & maximal
min = mean(substract$value, na.rm = T) - 3*sd(substract$value, na.rm = T)
max = mean(substract$value, na.rm = T) + 3*sd(substract$value, na.rm = T)

# plot metric & limit
plot_df(substract, NULL, c(-120,320))
abline(h=max, col="#e15f3f", lwd=2)
abline(h=min, col="#e15f3f", lwd=2)

# plot cercle
position = data.frame(id=seq(1, length(substract$value)), value=substract$value, date=substract$date)
anomalyH = position[position$value > max, ]
anomalyH = anomalyH[!is.na(anomalyH$value), ]
anomalyL = position[position$value < min, ]
anomalyL = anomalyL[!is.na(anomalyL$value), ]
anomaly = data.frame(id=c(anomalyH$id, anomalyL$id), value=c(anomalyH$value, anomalyL$value), date=c(anomalyH$date, anomalyL$date))
anomaly = anomaly[!is.na(anomaly$value), ]
points(x = anomaly$date, y=anomaly$value, col="#e15f3f")

#compute 3 sigma minimal & maximal

min = mean(substract$value, na.rm = T) - 3*sd(substract$value, na.rm = T)

max = mean(substract$value, na.rm = T) + 3*sd(substract$value, na.rm = T)

# plot metric & limit

plot_df(substract, NULL, c(-120,320))

abline(h=max, col="#e15f3f", lwd=2)

abline(h=min, col="#e15f3f", lwd=2)

# plot cercle

position = data.frame(id=seq(1, length(substract$value)), value=substract$value, date=substract$date)

anomalyH = position[position$value > max, ]

anomalyH = anomalyH[!is.na(anomalyH$value), ]

anomalyL = position[position$value < min, ]

anomalyL = anomalyL[!is.na(anomalyL$value), ]

anomaly = data.frame(id=c(anomalyH$id, anomalyL$id), value=c(anomalyH$value, anomalyL$value), date=c(anomalyH$date, anomalyL$date))

anomaly = anomaly[!is.na(anomaly$value), ]

points(x = anomaly$date, y=anomaly$value, col="#e15f3f")

correlation anomaly detected

Conclusion

Looking at the initial retention graph, I was able to spot 1 or 2 anomalies when in reality I had 7. I actually remember back in mid-November when I received the automatic anomaly report. I started digging into the problem, and found out that someone had shared my twitter anomaly review on Hacker News.

Monitor & detect anomalies with Anomaly.io

Bob Ziti
Wouldn’t you get the same result if you just looked into the ‘new users’ time serie by itself?
What additional perspective are you getting by looking at the aggregate of both time series?
- http://www.rozumim.cz/ ytus
  I rewrote this in Python as an exercise and the answer to your question is: you will *not* get the same results.
  My implementation is here: https://notebooks.azure.com/anon-te6iza/libraries/anomaly-io-in-python/html/anomaly.io.ipynb see the last cell [35] and the chart bellow with only four anomalies.
  - Bob Ziti
    This is very interesting, thank you!
    Have you ported any of the other blog posts to Python? also, is there any book on anomaly detection you’d personaly recommend?