To detect the correlation of time series we often use auto-correlation, cross-correlation or normalized cross-correlation. Let’s study these techniques to understand them better.
Definition:
- Cross-correlation is the comparison of two different time series to detect if there is a correlation between metrics with the same maximum and minimum values. For example: “Are two audio signals in phase?”
- Normalized cross-correlation is also the comparison of two time series, but using a different scoring result. Instead of simple cross-correlation, it can compare metrics with different value ranges. For example: “Is there a correlation between the number of customers in the shop and the number of sales per day?”
- Auto-correlation is the comparison of a time series with itself at a different time. It aims, for example, to detect repeating patterns or seasonality. For example: “Is there weekly seasonality on a server website?” “Does the current week’s data highly correlate with that of the previous week?”
- Normalized auto-correlation is the same as normalized cross-correlation, but for auto-correlation, thus comparing one metric with itself at a different time.
- Time Shift can be applied to all of the above algorithms. The idea is to compare a metric to another one with various “shifts in time”. Applying a time shift to the normalized cross-correlation function will result in a “normalized cross-correlation with a time shift of X”. This can be used to answer questions such as: “When many customers come in my shop, do my sales increase 20 minutes later?”
Cross-Correlation
To detect a level of correlation between two signals we use cross-correlation. It is calculated simply by multiplying and summing two-time series together.
In the following example, graphs A and B are cross-correlated but graph C is not correlated to either.
1 2 3 4 5 6 7 8 9 10 | # plot the graph in R a = c(1,2,-2,4,2,3,1,0) b = c(2,3,-2,3,2,4,1,-1) c = c(-2,0,4,0,1,1,0,-2) plot(ts(a), col="#f44e2e", lwd=2) lines(b, col="#27ccc0", lwd=2) lines(c, col="#273ecc", lwd=2) legend("topright", c("a","b","c"), col=c("#f44e2e","#27ccc0","#273ecc"), lty=c(1), lwd = 2) |
Using the cross-correlation formula above we can calculate the level of correlation between series.
$$corr(x, y) = \sum_{n=0}^{n-1} x[n]*y[n]$$
$$\begin{align}
corr(a, b) & = 1*2+2*3+-2*-2+4*3+2*2+3*4+1*1+0*-1 \\
& = 41
\end{align}$$
$$\begin{align}
corr(a, c) & =1*-2+2*0+-2*4+4*0+2*1+3*1+1*0+0*-2 \\
& =-5
\end{align}$$
Graphs A and B correlate, with a high value of 41.
Graphs A and C don’t correlate, having a low value of -5.
1 2 3 | # compute using the R language corr_ab = sum(a*b) # equal 41 corr_ac = sum(a*c) # equal -5 |
Normalized Cross-Correlation
There are three problems with cross-correlation:
- It is difficult to understand the scoring value.
- Both metrics must have the same amplitude. If Graph B has the same shape as Graph A but values two times smaller, the correlation will not be detected.
corr(a, a/2) = 19.5 - Due to the formula, a zero value will not be taken into account, since 0*0=0 and 0*200=0.
To solve these problems we use normalized cross-correlation:
$$norm\_corr(x,y)=\dfrac{\sum_{n=0}^{n-1} x[n]*y[n]}{\sqrt{\sum_{n=0}^{n-1} x[n]^2 * \sum_{n=0}^{n-1} y[n]^2}}$$
Using this formula let’s compute the normalized cross-correlation of AB and AC.
$$\begin{align}
norm\_corr(a,b) &= \dfrac{1*2+2*3+-2*-2+4*3+2*2+3*4+1*1+0*-1}{\sqrt{(1+4+4+16+4+9+1+0)*(4+9+4+9+4+16+1+1)}} \\
& = \dfrac{41}{\sqrt{(39)*(48)}} \\
& = 0.947
\end{align}$$
$$\begin{align}
norm\_corr(a,c) & =\dfrac{1*-2+2*0+-2*4+4*0+2*1+3*1+1*0+0*-2}{\sqrt{(1+4+4+16+4+9+1+0)*(4+0+16+0+1+1+0+4)}} \\
& =\dfrac{-5}{\sqrt{(39)*(26)}} \\
& =-0.157
\end{align}$$
Graphs A and B correlate, with a high value of 0.947.
Graphs A and C don’t correlate, showing a low value of -0.157.
- Normalized cross-correlation scoring is easy to understand:
– The higher the value, the higher the correlation is.
– The maximum value is 1 when two signals are exactly the same:
norm_corr(a,a)=1
– The minimum value is -1 when two signals are exactly opposite:
norm_corr(a, -a) = -1 - Normalized cross-correlation can detect the correlation of two signals with different amplitudes: norma_corr(a, a/2) = 1.
Notice we have perfect correlation between signal A and the same signal with half the amplitude!
1 2 3 | # compute using the R language norm_corr_ab = sum(a*b) / sqrt(sum(a^2)*sum(b^2)) #equal 0.947 norm_corr_ac = sum(a*c) / sqrt(sum(a^2)*sum(c^2)) #equal -0.157 |
Auto-Correlation
Auto-correlation is very useful in many applications; a common one is detecting repeatable patterns due to seasonality.
The following graph clearly shows repeating patterns every 8 data points. Indeed, looking at the R code, it’s a repeatable sequence of the numbers 1 through 8 with some random noise in the mix.
1 2 3 4 | # compute using the R language set.seed(5) ar = rep(c(1,2,3,4,5,6,7,8), 8) + rnorm(8*8, sd = 0.7) plot(ts(ar), col="#f44e2e", lwd=2) |
Let’s compute the auto-correlation between the signal and itself at a time shift of 4 and time shift of 8. The following graphs clearly show a high auto-correlation at time shift 8, but not at time shift 4.
1 2 3 4 5 6 7 8 9 10 11 12 | # plot using the R language ar4 = ar[1:(length(ar)-4)] ar4_shift = ar[5:length(ar)] plot(ts(ar4), col="#f44e2e", lwd=2, xlim=c(0,78)) lines(ar4_shift, col="#27ccc0", lwd=2) legend("topright", c("original","shift 4"), col = c("#f44e2e","#27ccc0"), lty=c(1,1)) ar8 = ar[1:(length(ar)-8)] ar8_shift = ar[9:length(ar)] plot(ts(ar8), col="#f44e2e", lwd=2, xlim=c(0,72)) lines(ar8_shift, col="#27ccc0", lwd=2) legend("topright", c("original","shift 8"), col = c("#f44e2e","#27ccc0"), lty=c(1,1)) |
To compute, we can use the same formula as cross-correlation (see above).
1 2 3 | # compute using the R language corr_arar4 = sum(ar4*ar4_shift) #equals 1130.705 corr_arar8 = sum(ar8*ar8_shift) #equals 1456.428 |
For a time shift of 8 the auto-correlation is higher than a time shift of 4. We have detected seasonality with a period of 8.
Normalized Auto-Correlation
We discussed earlier the advantages of normalized cross-correlation. In the same way, we can compute the normalized auto-correlation with time shifts of 4 and 8:
1 2 3 | # compute using the R language norm_auto_arar4 = sum(ar4*ar4_shift) / sqrt(sum(ar4^2)*sum(ar4_shift^2)) #equal 0.726 norm_auto_arar8 = sum(ar8*ar8_shift) / sqrt(sum(ar8^2)*sum(ar8_shift^2)) #equal 0.981 |
Normalized cross-correlation makes it very obvious that the signal repeats in a similar manner every 8 data points.
Correlation with Time Shift
All correlation techniques can be modified by applying a time shift. For example, it is very common to perform a normalized cross-correlation with time shift to detect if a signal “lags” or “leads” another.
To process a time shift, we correlate the original signal with another one moved by x elements to the right or left. Just as we did for auto-correlation.
To detect if two metrics are correlated with a time shift we need to compute all the possible time shifts. Fortunately, the R language can compute all the correlations with time shift very quickly.
Normalized Cross-Correlation with Time Shift
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | # using R language library(stats) # Normalized Cross-Correlation for lags from -4 to 4 a = c(0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0,1,2,3,4) b = c(1,2,3,3,0,1,2,3,4,0,1,1,4,4,0,1,2,3,4,0) #show graph plot(ts(a), col="#f44e2e", lwd=2) lines(b, col="#27ccc0", lwd=2) legend("topright", c("a","b"), col=c("#f44e2e","#27ccc0"), lty=c(1), lwd = 2) r = ccf(a,b, lag.max = 4) r #show correlation values |
-4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | 4 |
0.862 | 0.021 | -0.547 | -0.423 | 0.000 | 0.867 | 0.127 | -0.466 | -0.393 |
As we expected from the graph above, the metrics highly correlate with a time shift of 1.
Normalized Auto-Correlation with Time Shift
1 2 3 4 5 6 7 8 9 10 11 12 | # using R language library(stats) # Normalized Auto-Correlation for lags from -10 to 10 set.seed(5) ar = rep(c(1,2,3,4,5,6,7,8), 8) + rnorm(8*8, sd = 0.7) #display plot(ts(ar), col="#f44e2e", lwd=2) r = acf(ar, lag.max = 10) r # show correlation values |
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
1.000 | 0.335 | -0.122 | -0.304 | -0.369 | -0.374 | -0.226 | 0.187 | 0.789 | 0.306 | -0.120 |
The output above repeats every 8 datapoints. As expected, the auto-correlation detects a high correlation when the series is compared to itself at a time shift of 8.
Conclusion
Here at anomaly.io, we commonly use both cross-correlation and auto-correlation, which are building blocks to detecting unusual patterns in your data. As auto-correlation can detect the seasonality of a metric, we can apply a range of anomaly detection algorithms such as seasonal decomposition of time series or seasonally adjusting a time series. When a cross-correlation is found, we can detect anomalies when the correlation is broken between the series.
Be sure to also read “Detecting Correlation Among Time Series“.
Monitor & detect anomalies with Anomaly.io
SIGN UP