Explain (Cross / Auto) Correlation, Normalize & Time shift

normalized-cross-correlation-auto-shift

To detect the correlation of time series we often use auto-correlation, cross-correlation or normalized cross-correlation. Let’s study these techniques to understand them better.

Definition:

Cross-correlation is the comparison of two different time series to detect if there is a correlation between metrics with the same maximum and minimum values. For example: “Are two audio signals in phase?”
Normalized cross-correlation is also the comparison of two time series, but using a different scoring result. Instead of simple cross-correlation, it can compare metrics with different value ranges. For example: “Is there a correlation between the number of customers in the shop and the number of sales per day?”
Auto-correlation is the comparison of a time series with itself at a different time. It aims, for example, to detect repeating patterns or seasonality. For example: “Is there weekly seasonality on a server website?” “Does the current week’s data highly correlate with that of the previous week?”
Normalized auto-correlation is the same as normalized cross-correlation, but for auto-correlation, thus comparing one metric with itself at a different time.
Time Shift can be applied to all of the above algorithms. The idea is to compare a metric to another one with various “shifts in time”. Applying a time shift to the normalized cross-correlation function will result in a “normalized cross-correlation with a time shift of X”. This can be used to answer questions such as: “When many customers come in my shop, do my sales increase 20 minutes later?”

Cross-Correlation

To detect a level of correlation between two signals we use cross-correlation. It is calculated simply by multiplying and summing two-time series together.

In the following example, graphs A and B are cross-correlated but graph C is not correlated to either.

similar

# plot the graph in R
a = c(1,2,-2,4,2,3,1,0)
b = c(2,3,-2,3,2,4,1,-1)
c = c(-2,0,4,0,1,1,0,-2)

plot(ts(a), col="#f44e2e", lwd=2)
lines(b, col="#27ccc0", lwd=2)
lines(c, col="#273ecc", lwd=2)
legend("topright", c("a","b","c"), 
       col=c("#f44e2e","#27ccc0","#273ecc"), lty=c(1), lwd = 2)

# plot the graph in R

a = c(1,2,-2,4,2,3,1,0)

b = c(2,3,-2,3,2,4,1,-1)

c = c(-2,0,4,0,1,1,0,-2)

plot(ts(a), col="#f44e2e", lwd=2)

lines(b, col="#27ccc0", lwd=2)

lines(c, col="#273ecc", lwd=2)

legend("topright", c("a","b","c"),

col=c("#f44e2e","#27ccc0","#273ecc"), lty=c(1), lwd = 2)

Using the cross-correlation formula above we can calculate the level of correlation between series.

$$corr(x, y) = \sum_{n=0}^{n-1} x[n]*y[n]$$

$$\begin{align}
corr(a, b) & = 1*2+2*3+-2*-2+4*3+2*2+3*4+1*1+0*-1 \\
& = 41
\end{align}$$

$$\begin{align}
corr(a, c) & =1*-2+2*0+-2*4+4*0+2*1+3*1+1*0+0*-2 \\
& =-5
\end{align}$$

Graphs A and B correlate, with a high value of 41.
Graphs A and C don’t correlate, having a low value of -5.

# compute using the R language
corr_ab = sum(a*b) # equal 41
corr_ac = sum(a*c) # equal -5

# compute using the R language

corr_ab = sum(a*b) # equal 41

corr_ac = sum(a*c) # equal -5

Normalized Cross-Correlation

There are three problems with cross-correlation:

It is difficult to understand the scoring value.
Both metrics must have the same amplitude. If Graph B has the same shape as Graph A but values two times smaller, the correlation will not be detected.
corr(a, a/2) = 19.5
Due to the formula, a zero value will not be taken into account, since 0*0=0 and 0*200=0.

To solve these problems we use normalized cross-correlation:

$$norm\_corr(x,y)=\dfrac{\sum_{n=0}^{n-1} x[n]*y[n]}{\sqrt{\sum_{n=0}^{n-1} x[n]^2 * \sum_{n=0}^{n-1} y[n]^2}}$$

Using this formula let’s compute the normalized cross-correlation of AB and AC.

$$\begin{align}
norm\_corr(a,b) &= \dfrac{1*2+2*3+-2*-2+4*3+2*2+3*4+1*1+0*-1}{\sqrt{(1+4+4+16+4+9+1+0)*(4+9+4+9+4+16+1+1)}} \\
& = \dfrac{41}{\sqrt{(39)*(48)}} \\
& = 0.947
\end{align}$$

$$\begin{align}
norm\_corr(a,c) & =\dfrac{1*-2+2*0+-2*4+4*0+2*1+3*1+1*0+0*-2}{\sqrt{(1+4+4+16+4+9+1+0)*(4+0+16+0+1+1+0+4)}} \\
& =\dfrac{-5}{\sqrt{(39)*(26)}} \\
& =-0.157
\end{align}$$

Graphs A and B correlate, with a high value of 0.947.
Graphs A and C don’t correlate, showing a low value of -0.157.

Normalized cross-correlation scoring is easy to understand:
– The higher the value, the higher the correlation is.
– The maximum value is 1 when two signals are exactly the same:
norm_corr(a,a)=1
– The minimum value is -1 when two signals are exactly opposite:
norm_corr(a, -a) = -1
Normalized cross-correlation can detect the correlation of two signals with different amplitudes: norma_corr(a, a/2) = 1.
Notice we have perfect correlation between signal A and the same signal with half the amplitude!

# compute using the R language
norm_corr_ab = sum(a*b) / sqrt(sum(a^2)*sum(b^2)) #equal 0.947
norm_corr_ac = sum(a*c) / sqrt(sum(a^2)*sum(c^2)) #equal -0.157

# compute using the R language

norm_corr_ab = sum(a*b) / sqrt(sum(a^2)*sum(b^2)) #equal 0.947

norm_corr_ac = sum(a*c) / sqrt(sum(a^2)*sum(c^2)) #equal -0.157

Auto-Correlation

Auto-correlation is very useful in many applications; a common one is detecting repeatable patterns due to seasonality.

The following graph clearly shows repeating patterns every 8 data points. Indeed, looking at the R code, it’s a repeatable sequence of the numbers 1 through 8 with some random noise in the mix.

auto-correlation

# compute using the R language
set.seed(5)
ar = rep(c(1,2,3,4,5,6,7,8), 8) + rnorm(8*8, sd = 0.7)
plot(ts(ar), col="#f44e2e", lwd=2)

# compute using the R language

set.seed(5)

ar = rep(c(1,2,3,4,5,6,7,8), 8) + rnorm(8*8, sd = 0.7)

plot(ts(ar), col="#f44e2e", lwd=2)

Let’s compute the auto-correlation between the signal and itself at a time shift of 4 and time shift of 8. The following graphs clearly show a high auto-correlation at time shift 8, but not at time shift 4.

no-auto-correlation

# plot using the R language
ar4 = ar[1:(length(ar)-4)]
ar4_shift = ar[5:length(ar)]
plot(ts(ar4), col="#f44e2e", lwd=2, xlim=c(0,78))
lines(ar4_shift, col="#27ccc0", lwd=2)
legend("topright", c("original","shift 4"), col = c("#f44e2e","#27ccc0"), lty=c(1,1))

ar8 = ar[1:(length(ar)-8)]
ar8_shift = ar[9:length(ar)]
plot(ts(ar8), col="#f44e2e", lwd=2, xlim=c(0,72))
lines(ar8_shift, col="#27ccc0", lwd=2)
legend("topright", c("original","shift 8"), col = c("#f44e2e","#27ccc0"), lty=c(1,1))

# plot using the R language

ar4 = ar[1:(length(ar)-4)]

ar4_shift = ar[5:length(ar)]

plot(ts(ar4), col="#f44e2e", lwd=2, xlim=c(0,78))

lines(ar4_shift, col="#27ccc0", lwd=2)

legend("topright", c("original","shift 4"), col = c("#f44e2e","#27ccc0"), lty=c(1,1))

ar8 = ar[1:(length(ar)-8)]

ar8_shift = ar[9:length(ar)]

plot(ts(ar8), col="#f44e2e", lwd=2, xlim=c(0,72))

lines(ar8_shift, col="#27ccc0", lwd=2)

legend("topright", c("original","shift 8"), col = c("#f44e2e","#27ccc0"), lty=c(1,1))

To compute, we can use the same formula as cross-correlation (see above).

# compute using the R language
corr_arar4 = sum(ar4*ar4_shift) #equals 1130.705
corr_arar8 = sum(ar8*ar8_shift) #equals 1456.428

# compute using the R language

corr_arar4 = sum(ar4*ar4_shift) #equals 1130.705

corr_arar8 = sum(ar8*ar8_shift) #equals 1456.428

For a time shift of 8 the auto-correlation is higher than a time shift of 4. We have detected seasonality with a period of 8.

Normalized Auto-Correlation

We discussed earlier the advantages of normalized cross-correlation. In the same way, we can compute the normalized auto-correlation with time shifts of 4 and 8:

# compute using the R language
norm_auto_arar4 = sum(ar4*ar4_shift) / sqrt(sum(ar4^2)*sum(ar4_shift^2)) #equal 0.726
norm_auto_arar8 = sum(ar8*ar8_shift) / sqrt(sum(ar8^2)*sum(ar8_shift^2)) #equal 0.981

# compute using the R language

norm_auto_arar4 = sum(ar4*ar4_shift) / sqrt(sum(ar4^2)*sum(ar4_shift^2)) #equal 0.726

norm_auto_arar8 = sum(ar8*ar8_shift) / sqrt(sum(ar8^2)*sum(ar8_shift^2)) #equal 0.981

Normalized cross-correlation makes it very obvious that the signal repeats in a similar manner every 8 data points.

Correlation with Time Shift

All correlation techniques can be modified by applying a time shift. For example, it is very common to perform a normalized cross-correlation with time shift to detect if a signal “lags” or “leads” another.

To process a time shift, we correlate the original signal with another one moved by x elements to the right or left. Just as we did for auto-correlation.

To detect if two metrics are correlated with a time shift we need to compute all the possible time shifts. Fortunately, the R language can compute all the correlations with time shift very quickly.

Normalized Cross-Correlation with Time Shift

# using R language 
library(stats)

# Normalized Cross-Correlation for lags from -4 to 4
a = c(0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0,1,2,3,4)
b = c(1,2,3,3,0,1,2,3,4,0,1,1,4,4,0,1,2,3,4,0)

#show graph
plot(ts(a), col="#f44e2e", lwd=2)
lines(b, col="#27ccc0", lwd=2)
legend("topright", c("a","b"), 
       col=c("#f44e2e","#27ccc0"), lty=c(1), lwd = 2)

r = ccf(a,b, lag.max = 4)
r #show correlation values

# using R language

library(stats)

# Normalized Cross-Correlation for lags from -4 to 4

a = c(0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0,1,2,3,4)

b = c(1,2,3,3,0,1,2,3,4,0,1,1,4,4,0,1,2,3,4,0)

#show graph

plot(ts(a), col="#f44e2e", lwd=2)

lines(b, col="#27ccc0", lwd=2)

legend("topright", c("a","b"),

col=c("#f44e2e","#27ccc0"), lty=c(1), lwd = 2)

r = ccf(a,b, lag.max = 4)

r #show correlation values

cross-correlation

ACF-lag

-4	-3	-2	-1	0	1	2	3	4
0.862	0.021	-0.547	-0.423	0.000	0.867	0.127	-0.466	-0.393

As we expected from the graph above, the metrics highly correlate with a time shift of 1.

Normalized Auto-Correlation with Time Shift

# using R language 
library(stats)

# Normalized Auto-Correlation for lags from -10 to 10
set.seed(5)
ar = rep(c(1,2,3,4,5,6,7,8), 8) + rnorm(8*8, sd = 0.7)

#display
plot(ts(ar), col="#f44e2e", lwd=2)

r = acf(ar, lag.max = 10)
r # show correlation values

# using R language

library(stats)

# Normalized Auto-Correlation for lags from -10 to 10

set.seed(5)

ar = rep(c(1,2,3,4,5,6,7,8), 8) + rnorm(8*8, sd = 0.7)

#display

plot(ts(ar), col="#f44e2e", lwd=2)

r = acf(ar, lag.max = 10)

r # show correlation values

auto-correlation

auto-correlation with time shift

0	1	2	3	4	5	6	7	8	9	10
1.000	0.335	-0.122	-0.304	-0.369	-0.374	-0.226	0.187	0.789	0.306	-0.120

The output above repeats every 8 datapoints. As expected, the auto-correlation detects a high correlation when the series is compared to itself at a time shift of 8.

Conclusion

Here at anomaly.io, we commonly use both cross-correlation and auto-correlation, which are building blocks to detecting unusual patterns in your data. As auto-correlation can detect the seasonality of a metric, we can apply a range of anomaly detection algorithms such as seasonal decomposition of time series or seasonally adjusting a time series. When a cross-correlation is found, we can detect anomalies when the correlation is broken between the series.

Be sure to also read “Detecting Correlation Among Time Series“.

Monitor & detect anomalies with Anomaly.io

Understanding Cross-Correlation, Auto-Correlation, Normalization and Time Shift

Definition:

Cross-Correlation

Normalized Cross-Correlation

Auto-Correlation

Normalized Auto-Correlation

Correlation with Time Shift

Normalized Cross-Correlation with Time Shift

Normalized Auto-Correlation with Time Shift

Conclusion