Anomaly » math https://anomaly.io Just another WordPress site Fri, 21 Dec 2018 19:16:49 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.2 Detecting Anomalies in Correlated Time Serieshttps://anomaly.io/detect-anomalies-in-correlated-time-series/ https://anomaly.io/detect-anomalies-in-correlated-time-series/#comments Wed, 25 Jan 2017 15:22:24 +0000 https://anomaly.io/?p=3232 The post Detecting Anomalies in Correlated Time Series appeared first on Anomaly.

]]>

anomaly in time series correlation

Monitoring key performance indicators (KPIs), sales or any other product data means working within an ecosystem where very often you will see metrics correlating with each other. When a normal correlation between two metrics is broken, we have reason to suspect something strange is happening.

As an example, take a look at anomaly.io analytics during its early days (a long time ago). In the graphic above, the new users are shown in green and the returning users in red. Clearly, something strange happens in the middle of November. Let’s use some techniques to find out more!

Step 1: Collect the Data

With Google Analytics it’s fairly easy to download data on new and returning visitors. Here at Anomaly.io we detect problems in real time, so we prefer streaming our Google Analytics in real time. But for the sake of simplicity, we will use a plain CSV file for this exercise.

You can download your own Google Analytics data, or use my sample monitoring data file if you wish: new-vs-returning-visitor.csv.

google-analytics-correlation

#R language
install.packages("dplyr")
library(dplyr)

#load data
csv = read.csv("/path/to/new-vs-returning-visitor.csv")

#load new visitor
new = filter(csv,Segment=="New Users")
new_data = new[4]$Sessions
new_date = as.Date(lapply(new[1], as.character)$Day_Index, "%m/%d/%y")
new_df = data.frame(date=new_date, value=new_data)

#load returning visitor
return = filter(csv,Segment=="Returning Users")
return_data = return[4]$Sessions
return_date = as.Date(lapply(return[1], as.character)$Day_Index, "%m/%d/%y")
return_df = data.frame(date=return_date, value=return_data)

#plot
plot_df = function(new_df, return_df = NULL, ylim=c()) {
	plot(value ~ date, new_df, xaxt="n", type="l", col="#00c7b0", ylim=ylim, lwd=2)
	date_axis = new_df$date[1:(length(new_df$date)/30.5)*30.5]+2
	axis(1, date_axis, format(date_axis, "%b"), cex.axis = .7)
	if(!is.null(return_df)) {
		lines(value ~ date, return_df, xaxt="n", type="l", col="#ff0816", lwd=2)
	}
}
plot_df(new_df, return_df)
legend("topleft", c("new visitor","returning visitor"),
      col=c("#00c7b0","#ff0816"), lty=c(1,1))

anomaly correlation R

Now you have two variables in your R environment:

  • new_df: new visitors
  • return_df: returning visitors

Step 2: Verify the Cross-Correlation

Our analysis is based on the time series being correlated, so before going any further, let’s ensure that this is the case. To do so, we need to check using Cross-Correlation. Check this article if you want to Understand the Cross-Correlation algorithm.

BONUS: Find correlated metrics in multiple times series

library(stats)
library(forecast)
ma_new_data = ma(new_data, 3)
ma_new_data = ma_new_data[!is.na(ma_new_data)]
ma_return_data = ma(return_data, 3)
ma_return_data = ma_return_data[!is.na(ma_return_data)]

corr = ccf(ma_new_data, ma_return_data, lag.max = 0)
corr # display 0.876

As the Moving Average is robust to anomaly we use it to remove potential outliers before computing the correlation. The Cross Correlation Function (CCF) is a very high value of 0.876. Clearly, the time series are correlated.

Step 3: Subtract the Time Series

We can compute that there are 1.736 times more new visitors than returning visitors. Let’s align the metrics at the same level.

multi = sum(ma_new_data/ma_return_data)/length(ma_return_data)
multi #display 1.736529
align_return_df = return_df
align_return_df$value = align_return_df$value*multi
plot_df(new_df, align_return_df)
legend("topleft", c("new visitor","returning visitor x1.736"), 
     col=c("#00c7b0","#ff0816"), lty=c(1,1))

align correlation

substract = return_df
substract$value = new_df$value - return_df$value*multi
plot_df(substract)

subtract

Step 4: Find Outliers in Correlated Time Series

The resulting signal looks like “noise”. Plotting the histogram show a normally distributed signal. We already know how to detect anomalies in a normally distributed time series using the 3-sigma rule.

hist(substract$value, breaks = 50)

histogram normal distribution

The 3-sigma rules express that nearly all values are taken to lie within three standard deviations of the mean. Everything outside this range can be treated as anomalous.

#compute 3 sigma minimal & maximal
min = mean(substract$value, na.rm = T) - 3*sd(substract$value, na.rm = T)
max = mean(substract$value, na.rm = T) + 3*sd(substract$value, na.rm = T)

# plot metric & limit
plot_df(substract, NULL, c(-120,320))
abline(h=max, col="#e15f3f", lwd=2)
abline(h=min, col="#e15f3f", lwd=2)

# plot cercle
position = data.frame(id=seq(1, length(substract$value)), value=substract$value, date=substract$date)
anomalyH = position[position$value > max, ]
anomalyH = anomalyH[!is.na(anomalyH$value), ]
anomalyL = position[position$value < min, ]
anomalyL = anomalyL[!is.na(anomalyL$value), ]
anomaly = data.frame(id=c(anomalyH$id, anomalyL$id), value=c(anomalyH$value, anomalyL$value), date=c(anomalyH$date, anomalyL$date))
anomaly = anomaly[!is.na(anomaly$value), ]
points(x = anomaly$date, y=anomaly$value, col="#e15f3f")

correlation anomaly detected

Conclusion

Looking at the initial retention graph, I was able to spot 1 or 2 anomalies when in reality I had 7. I actually remember back in mid-November when I received the automatic anomaly report. I started digging into the problem, and found out that someone had shared my twitter anomaly review on Hacker News.

The post Detecting Anomalies in Correlated Time Series appeared first on Anomaly.

]]>
https://anomaly.io/detect-anomalies-in-correlated-time-series/feed/ 3
Detecting Correlation Among Multiple Time Serieshttps://anomaly.io/detect-correlation-time-series/ https://anomaly.io/detect-correlation-time-series/#comments Thu, 10 Mar 2016 14:09:53 +0000 https://anomaly.io/?p=3166 The post Detecting Correlation Among Multiple Time Series appeared first on Anomaly.

]]>

detect-correlation

To determine the level of correlation between various metrics we often use the normalized cross-correlation formula.2

Definition: Normalized Cross-Correlation

Normalized cross-correlation is calculated using the formula:

$$norm\_corr(x,y)=\dfrac{\sum_{n=0}^{n-1} x[n]*y[n]}{\sqrt{\sum_{n=0}^{n-1} x[n]^2 * \sum_{n=0}^{n-1} y[n]^2}}$$

We recommend first understanding normalized cross correlation before using it, but any statistical language, such as R, can easily compute it for you.

Correlations between 2 metrics

In the following graph, the two metrics show some correlation between each other.

correlation

# plot the graph in R
set.seed(15)
a = c(1,2,-2,4,2,3,1,0,3,4,2,3,1)
b = a + rnorm(length(a), sd = 0.4)
plot(ts(b), col="#f44e2e", lwd=3)
lines(a, col="#27ccc0", lwd=3)

Using R to compute the normalized cross-correlation is as easy as calling the function CCF (for Cross Correlation Functions). By default, CCF plots the correlation between two metrics at different time shifts. It’s easy to understand time shifting, which simply moves the compared metrics to different times. This is useful in detecting when a metric precedes or succeeds another.

# compute using the R language
corr = ccf(a,b)
corr

correlation-level

-8-7-6-5-4-3-2-1012345678
-0.011-0.1460.0610.266-0.025-0.2460.018-0.2930.979-0.2910.098-0.320-0.0430.2880.037-0.1830.083

The last R command displays the correlation between the metrics at various time shift values. As expected, the metrics are highly correlated at time shift 0 (no time shift) with a value of 0.979.

Cluster Correlated Metrics Together

We can also use the CCF function to cluster similar metrics together based how similar they are. To demonstrate this better, we will cluster metrics from a real data set of 45 graphs call “graph45.csv”.

First, we need to compute the correlating level between every possible pair of graphs. This is what the “correlationTable” function does. (To reproduce this example you must download the data set graphs45.csv).

correlationTable = function(graphs) {
  cross = matrix(nrow = length(graphs), ncol = length(graphs))
  for(graph1Id in 1:length(graphs)){
    graph1 = graphs[[graph1Id]]
    print(graph1Id)
    for(graph2Id in 1:length(graphs)) {
      graph2 = graphs[[graph2Id]]
      if(graph1Id == graph2Id){
        break;
      } else {
        correlation = ccf(graph1, graph2, lag.max = 0)
        cross[graph1Id, graph2Id] = correlation$acf[1]
      }
    }
  }
  cross
}

graphs = read.csv("graphs45.csv")
corr = correlationTable(graphs)

It took around 20 seconds to compute all the correlation possibilities between every pair of graphs. The array corr now contains the correlation table; for example, corr[4,3] gives a correlation level of 0.990 between graph4 and graph3. Such a high correlation level indicates a strong correlation between the graphs. To find metrics with sufficiently high correlation, we choose a minimum correlation level of 0.90.

Let’s find and plot all the metrics that strongly correlate with graph4:

findCorrelated = function(orig, highCorr){
 match = highCorr[highCorr[,1] == orig | highCorr[,2] == orig,]
 match = as.vector(match)
 match[match != orig]
}

highCorr = which(corr > 0.90 , arr.ind = TRUE)
match = findCorrelated(4, highCorr)
match # print 6 12 23 42 44 45  3

Success! Graph4 highly correlates with graphs 6, 12, 23, 42, 44, 45 and 3.

Let’s now plot all the graphs together:

correlate-graphs-cluster

bound = function(graphs, orign, match) {
  graphOrign = graphs[[orign]]
  graphMatch = graphs[match]
  allValue = c(graphOrign)
  for(m in graphMatch){
    allValue = c(allValue, m)
  }
  c(min(allValue), max(allValue))
}

plotSimilar = function(graphs, orign, match){
  lim = bound(graphs, orign, match)

  graphOrign = graphs[[orign]]
  plot(ts(graphOrign), ylim=lim, xlim=c(1,length(graphOrign)+25), lwd=3)
  title(paste("Similar to", orign, "(black bold)"))

  cols = c()
  names = c()
  for(i in 1:length(match)) {
    m = match[[i]]
    matchGraph = graphs[[m]]
    lines(x = 1:length(matchGraph), y=matchGraph, col=i)

    cols = c(cols, i)
    names = c(names, paste0(m))
  }
  legend("topright", names, col = cols, lty=c(1,1))
}

plotSimilar(graphs, 4, match)

Conclusion

Here at anomaly.io, finding cross-correlation is one of the first steps in detecting unusual patterns in your data. Subtracting two correlated metrics should result in an almost flat signal. If suddenly the flat signal (or the gap between the curves) hits a certain level, you can trigger an anomaly. Of course this is an oversimplification and the reality is much more complex, but it’s a good foundation to work from.

The post Detecting Correlation Among Multiple Time Series appeared first on Anomaly.

]]>
https://anomaly.io/detect-correlation-time-series/feed/ 3
Understanding Cross-Correlation, Auto-Correlation, Normalization and Time Shifthttps://anomaly.io/understand-auto-cross-correlation-normalized-shift/ https://anomaly.io/understand-auto-cross-correlation-normalized-shift/#comments Tue, 08 Mar 2016 20:04:04 +0000 https://anomaly.io/?p=3141 The post Understanding Cross-Correlation, Auto-Correlation, Normalization and Time Shift appeared first on Anomaly.

]]>

normalized-cross-correlation-auto-shift

To detect the correlation of time series we often use auto-correlation, cross-correlation or normalized cross-correlation. Let’s study these techniques to understand them better.

Definition:

  • Cross-correlation is the comparison of two different time series to detect if there is a correlation between metrics with the same maximum and minimum values. For example: “Are two audio signals in phase?”
  • Normalized cross-correlation is also the comparison of two time series, but using a different scoring result. Instead of simple cross-correlation, it can compare metrics with different value ranges. For example: “Is there a correlation between the number of customers in the shop and the number of sales per day?”
  • Auto-correlation is the comparison of a time series with itself at a different time. It aims, for example, to detect repeating patterns or seasonality. For example: “Is there weekly seasonality on a server website?” “Does the current week’s data highly correlate with that of the previous week?”
  • Normalized auto-correlation is the same as normalized cross-correlation, but for auto-correlation, thus comparing one metric with itself at a different time.
  • Time Shift can be applied to all of the above algorithms. The idea is to compare a metric to another one with various “shifts in time”. Applying a time shift to the normalized cross-correlation function will result in a “normalized cross-correlation with a time shift of X”. This can be used to answer questions such as: “When many customers come in my shop, do my sales increase 20 minutes later?”

Cross-Correlation

To detect a level of correlation between two signals we use cross-correlation. It is calculated simply by multiplying and summing two-time series together.

In the following example, graphs A and B are cross-correlated but graph C is not correlated to either.

similar

# plot the graph in R
a = c(1,2,-2,4,2,3,1,0)
b = c(2,3,-2,3,2,4,1,-1)
c = c(-2,0,4,0,1,1,0,-2)

plot(ts(a), col="#f44e2e", lwd=2)
lines(b, col="#27ccc0", lwd=2)
lines(c, col="#273ecc", lwd=2)
legend("topright", c("a","b","c"), 
       col=c("#f44e2e","#27ccc0","#273ecc"), lty=c(1), lwd = 2)

Using the cross-correlation formula above we can calculate the level of correlation between series.

$$corr(x, y) = \sum_{n=0}^{n-1} x[n]*y[n]$$

$$\begin{align}
corr(a, b) & = 1*2+2*3+-2*-2+4*3+2*2+3*4+1*1+0*-1 \\
& = 41
\end{align}$$

$$\begin{align}
corr(a, c) & =1*-2+2*0+-2*4+4*0+2*1+3*1+1*0+0*-2 \\
& =-5
\end{align}$$

Graphs A and B correlate, with a high value of 41.
Graphs A and C don’t correlate, having a low value of -5.

# compute using the R language
corr_ab = sum(a*b) # equal 41
corr_ac = sum(a*c) # equal -5

Normalized Cross-Correlation

There are three problems with cross-correlation:

  1.  It is difficult to understand the scoring value.
  2. Both metrics must have the same amplitude. If Graph B has the same shape as Graph A but values two times smaller, the correlation will not be detected.
    corr(a, a/2) =  19.5
  3. Due to the formula, a zero value will not be taken into account, since 0*0=0 and 0*200=0.

To solve these problems we use normalized cross-correlation:

$$norm\_corr(x,y)=\dfrac{\sum_{n=0}^{n-1} x[n]*y[n]}{\sqrt{\sum_{n=0}^{n-1} x[n]^2 * \sum_{n=0}^{n-1} y[n]^2}}$$

Using this formula let’s compute the normalized cross-correlation of AB and AC.

$$\begin{align}
norm\_corr(a,b) &= \dfrac{1*2+2*3+-2*-2+4*3+2*2+3*4+1*1+0*-1}{\sqrt{(1+4+4+16+4+9+1+0)*(4+9+4+9+4+16+1+1)}} \\
& = \dfrac{41}{\sqrt{(39)*(48)}} \\
& = 0.947
\end{align}$$

$$\begin{align}
norm\_corr(a,c) & =\dfrac{1*-2+2*0+-2*4+4*0+2*1+3*1+1*0+0*-2}{\sqrt{(1+4+4+16+4+9+1+0)*(4+0+16+0+1+1+0+4)}} \\
& =\dfrac{-5}{\sqrt{(39)*(26)}} \\
& =-0.157
\end{align}$$

Graphs A and B correlate, with a high value of 0.947.
Graphs A and C don’t correlate, showing a low value of -0.157.

  • Normalized cross-correlation scoring is easy to understand:
    – The higher the value, the higher the correlation is.
    – The maximum value is 1 when two signals are exactly the same:
    norm_corr(a,a)=1
    – The minimum value is -1 when two signals are exactly opposite:
    norm_corr(a, -a) = -1
  • Normalized cross-correlation can detect the correlation of two signals with different amplitudes: norma_corr(a, a/2) = 1.
    Notice we have perfect correlation between signal A and the same signal with half the amplitude!

# compute using the R language
norm_corr_ab = sum(a*b) / sqrt(sum(a^2)*sum(b^2)) #equal 0.947
norm_corr_ac = sum(a*c) / sqrt(sum(a^2)*sum(c^2)) #equal -0.157

Auto-Correlation

Auto-correlation is very useful in many applications; a common one is detecting repeatable patterns due to seasonality.

The following graph clearly shows repeating patterns every 8 data points. Indeed, looking at the R code, it’s a repeatable sequence of the numbers 1 through 8 with some random noise in the mix.

auto-correlation

# compute using the R language
set.seed(5)
ar = rep(c(1,2,3,4,5,6,7,8), 8) + rnorm(8*8, sd = 0.7)
plot(ts(ar), col="#f44e2e", lwd=2)

Let’s compute the auto-correlation between the signal and itself at a time shift of 4 and time shift of 8. The following graphs clearly show a high auto-correlation at time shift 8, but not at time shift 4.

no-auto-correlation auto-correlation

# plot using the R language
ar4 = ar[1:(length(ar)-4)]
ar4_shift = ar[5:length(ar)]
plot(ts(ar4), col="#f44e2e", lwd=2, xlim=c(0,78))
lines(ar4_shift, col="#27ccc0", lwd=2)
legend("topright", c("original","shift 4"), col = c("#f44e2e","#27ccc0"), lty=c(1,1))

ar8 = ar[1:(length(ar)-8)]
ar8_shift = ar[9:length(ar)]
plot(ts(ar8), col="#f44e2e", lwd=2, xlim=c(0,72))
lines(ar8_shift, col="#27ccc0", lwd=2)
legend("topright", c("original","shift 8"), col = c("#f44e2e","#27ccc0"), lty=c(1,1))

To compute, we can use the same formula as cross-correlation (see above).

# compute using the R language
corr_arar4 = sum(ar4*ar4_shift) #equals 1130.705
corr_arar8 = sum(ar8*ar8_shift) #equals 1456.428

For a time shift of 8 the auto-correlation is higher than a time shift of 4. We have detected seasonality with a period of 8.

Normalized Auto-Correlation

We discussed earlier the advantages of normalized cross-correlation. In the same way, we can compute the normalized auto-correlation with time shifts of 4 and 8:

# compute using the R language
norm_auto_arar4 = sum(ar4*ar4_shift) / sqrt(sum(ar4^2)*sum(ar4_shift^2)) #equal 0.726
norm_auto_arar8 = sum(ar8*ar8_shift) / sqrt(sum(ar8^2)*sum(ar8_shift^2)) #equal 0.981

Normalized cross-correlation makes it very obvious that the signal repeats in a similar manner every 8 data points.

Correlation with Time Shift

All correlation techniques can be modified by applying a time shift. For example, it is very common to perform a normalized cross-correlation with time shift to detect if a signal “lags” or “leads” another.

To process a time shift, we correlate the original signal with another one moved by x elements to the right or left. Just as we did for auto-correlation.

To detect if two metrics are correlated with a time shift we need to compute all the possible time shifts. Fortunately, the R language can compute all the correlations with time shift very quickly.

Normalized Cross-Correlation with Time Shift

# using R language 
library(stats)

# Normalized Cross-Correlation for lags from -4 to 4
a = c(0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0,1,2,3,4)
b = c(1,2,3,3,0,1,2,3,4,0,1,1,4,4,0,1,2,3,4,0)

#show graph
plot(ts(a), col="#f44e2e", lwd=2)
lines(b, col="#27ccc0", lwd=2)
legend("topright", c("a","b"), 
       col=c("#f44e2e","#27ccc0"), lty=c(1), lwd = 2)

r = ccf(a,b, lag.max = 4)
r #show correlation values

cross-correlation

ACF-lag

-4-3-2-101234
0.8620.021-0.547-0.4230.0000.8670.127-0.466-0.393

As we expected from the graph above, the metrics highly correlate with a time shift of 1.

Normalized Auto-Correlation with Time Shift

# using R language 
library(stats)

# Normalized Auto-Correlation for lags from -10 to 10
set.seed(5)
ar = rep(c(1,2,3,4,5,6,7,8), 8) + rnorm(8*8, sd = 0.7)

#display
plot(ts(ar), col="#f44e2e", lwd=2)

r = acf(ar, lag.max = 10)
r # show correlation values

auto-correlation

auto-correlation with time shift

012345678910
1.0000.335-0.122-0.304-0.369-0.374-0.2260.1870.7890.306-0.120

The output above repeats every 8 datapoints. As expected, the auto-correlation detects a high correlation when the series is compared to itself at a time shift of 8.

Conclusion

Here at anomaly.io, we commonly use both cross-correlation and auto-correlation, which are building blocks to detecting unusual patterns in your data. As auto-correlation can detect the seasonality of a metric, we can apply a range of anomaly detection algorithms such as seasonal decomposition of time series or seasonally adjusting a time series. When a cross-correlation is found, we can detect anomalies when the correlation is broken between the series.

Be sure to also read “Detecting Correlation Among Time Series“.

The post Understanding Cross-Correlation, Auto-Correlation, Normalization and Time Shift appeared first on Anomaly.

]]>
https://anomaly.io/understand-auto-cross-correlation-normalized-shift/feed/ 0
Change Point Detection with Seasonal Time Serieshttps://anomaly.io/change-point-detection-seasonal/ https://anomaly.io/change-point-detection-seasonal/#comments Mon, 08 Feb 2016 10:26:58 +0000 https://anomaly.io/?p=3076 The post Change Point Detection with Seasonal Time Series appeared first on Anomaly.

]]>

change-point-seasonality

Previously, we looked at using Twitter Breakout (EDM) to detect Anomalies. As with the popular E-Divisive, EDM detects mean shift and changes in distribution. Both algorithms work with seasonal time series, but perform even better without seasonality.

Change Point Doesn’t Work (Well) with Seasonality

To demonstrate the “weakness” of change point, let’s generate some fake seasonal time series. Then we will try to detect anomalies using two different change point detection algorithms: EDM and E-Divisive.

# install E-Divisive and EDM
install.packages("Rcpp")
install.packages("ecp")
install.packages("devtools")
devtools::install_github("twitter/BreakoutDetection")
library(Rcpp)
library(ecp)
library(BreakoutDetection)

createDay = function(noise=0) {
 point=seq(0, pi, by=0.02)
 connection=sin(point)
 noise = rnorm(length(point), sd = noise)
 return(connection+noise)
}

createDays = function(totalDays, noise=0) {
 allDays = c()
 for (day in 1:totalDays ) {
 allDays = c(allDays, createDay(noise))
 }
 return(allDays)
}

set.seed(1234)
p1 = createDays(3, 0.2)
anomaly = createDays(1, 0)*2 + rnorm(158, sd = 0.07)
days = c(p1, anomaly)
plot(as.ts(days))

# EDM - fail
res = breakout(days, min.size=158, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

# E-Divisive - fail
ediv = e.divisive(as.matrix(days), min.size=158, alpha=1)
plot(as.ts(days))
abline(v=ediv$estimates,col="blue")

fail-change-point-EDM fail-change-point-E-Divisive

As we can see, due to the seasonality of the time series, traditional change point detection doesn’t work very well.

Removing the Seasonality

To use change point detection effectively, we need to remove the seasonality from our time series. And to do that, we need to know the period of the seasonality. In this case, we know the seasonality to be 158 data points per day. If we don’t know, it’s possible to calculate the seasonality using a Fourier Transform. We also need to know if the time series is multiplicative or additive. In our example, I have an additive time series, but most of the time it is multiplicative. With this information, we can now decompose to remove the seasonality. Finally, we can run the change point detection again to get a successful result.

#remove seasonal
decomposed_days = decompose(ts(days, frequency = 158), "additive")
noSeasonal = days - decomposed_days$seasonal

#remove seasonal
decomposed_days = decompose(ts(p1, frequency = 158), "additive")
noSeasonal = days - rep(decomposed_days$seasonal[0:158],4)

# EDM - success
res = breakout(noSeasonal, min.size=158, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

# E-Divisive - success
ediv = e.divisive(as.matrix(noSeasonal), min.size=158, alpha=1)
plot(as.ts(noSeasonal))
abline(v=ediv$estimates,col="blue")

success-EDM success-E-Divisive

With the seasonality removed, breakout EDM and E-Divisive work a little bit better.

The post Change Point Detection with Seasonal Time Series appeared first on Anomaly.

]]>
https://anomaly.io/change-point-detection-seasonal/feed/ 0
Anomaly Detection with Twitter Breakouthttps://anomaly.io/anomaly-detection-using-twitter-breakout/ https://anomaly.io/anomaly-detection-using-twitter-breakout/#comments Wed, 20 Jan 2016 14:06:02 +0000 https://anomaly.io/?p=2873 The post Anomaly Detection with Twitter Breakout appeared first on Anomaly.

]]>

AnomalyDetectionBreakout

We previously tested Twitter Anomaly Detection package using the R language. Now let’s take a look at Twitter Breakout Detection.

What is Twitter Breakout Detection?

This Twitter package is intended to detect changes in time series. It is describe as an E-Divisive with Medians (EDM). It is supposed to:

To detect divergence the algorithm uses a clustering technique (mean shift clustering). As EDM is non-parametric, the data doesn’t need to follow any specific distribution; it can adapt to the current distribution. As a result, it can be used to detect when the distribution changes.

We previously explained how moving medians are robust to anomalies. The 3.5x speed improvement compare to E-Divisive is in part due to the use of Interval Trees to approximate the median very efficiently. However, this isn’t the only reason for the speed increase.

Let’s try it out ourselves.

Divergence Detection (Mean Shift, Ramp Up)

Mean Shift

A sudden jump in the time series is call a mean shift and represents the time series switching from one steady state to another. A good example (used by Twitter) is to imagine CPU utilization suddenly jumping from 40% to 60%. Let’s generate a fake time series to test the detection process.

# install BreakoutDetection
install.packages("devtools")
devtools::install_github("twitter/BreakoutDetection")
library(BreakoutDetection)

set.seed(123)
p1 = rnorm(60, mean = 0, sd = .4)
p2 = rnorm(60, mean = 6, sd = .4)
p3 = rnorm(60, mean = 0, sd = .4)
p4 = rnorm(60, mean = 3, sd = .4)
p5 = rnorm(60, mean = 0, sd = .4)
p6 = rnorm(60, mean = 6, sd = .4)
p7 = rnorm(60, mean = 0, sd = .4)
all = c(p1,p2,p3,p4,p5,p6,p7)
plot(as.ts(all), col = "#27ccc0", lwd = 4)

res = breakout(all, min.size=20, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

meanShift

Indeed, we were able to detect 7 mean divergences, also called break downs.
A simple help(breakout) command tells us more about the package.

min.size: The minimum number of observations between change points.
method: Method must be one of either “amoc” (At Most One Change) or “multi” (Multiple Changes). For “amoc.” at most one change point location will be returned.
degree: The degree of the penalization polynomial. Degree can take the values 0, 1, and 2. The default value is 1.
beta: A real number constant used to further control the amount of penalization. This is the default form of penalization; it will be used if neither beta nor percent are supplied. The default value is beta=0.008.

Ramp Up

A ramp up is a slow transition from one steady state to another. An example would be CPU utilization slowly transitioning from 40% to 60% over time. Let’s detect this kind of divergence.

ramp = 40
set.seed(1234)
p1 = rnorm(100, mean = 0, sd = .4)
p2 = rnorm(ramp, mean = 6, sd = 1) * seq(0, 1, length.out = ramp)
p3 = rnorm(100, mean = 6, sd = .4)
p4 = rnorm(ramp, mean = 6, sd = 1) * seq(1, 0, length.out = ramp)
p5 = rnorm(100, mean = 0, sd = .4)
p6 = rnorm(ramp, mean = 6, sd = 1) * seq(0, 1, length.out = ramp)
p7 = rnorm(100, mean = 6, sd = .4)
p8 = rnorm(ramp, mean = 6, sd = 1) * seq(1, 0, length.out = ramp)
p9 = rnorm(100, mean = 0, sd = .4)
all = c(p1,p2,p3,p4, p5,p6,p7,p8,p9)
plot(as.ts(all), col = "#27ccc0", lwd = 2)

res = breakout(all, min.size=20, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

ramp-up

Even with the ramp up, breakdown still detects 7 distinct states. It’s very powerful to be able to see the ramp up as a transition and not as a breakdown.

Detect Changes in Distribution

EDM doesn’t make assumptions about the distribution of the time series. Instead, it learns the current distribution and uses it as a reference. When the distribution suddenly changes, EDM can detect the variation.

set.seed(123456) 
p1 = rnorm(1000, mean = 1)
p2 = rgamma(1000, shape = 1)
p3 = rpois(1000, lambda = 3)/2 - .5
p4 = runif(1000) + .5
p5 = rexp(1000)
p6 = rweibull(1000, shape = 1)

all = c(p1, p2, p3, p4, p5, p6)
plot(as.ts(all))
abline(v=seq(100,1000*6,100),col="blue")

res = breakout(all, min.size=200, method='multi', beta=0.00018, degree=1, plot=TRUE)
res$plot

distribution-change

In the example above we switch alternatively from the Normal distribution to the Gamma, Poisson, Uniform, Exponential and Weibull distributions. With some (difficult) fine tuning, EDM can detect the change from Normal to Gamma distributions and from Uniform to Exponential. Unfortunately, it doesn’t detect the change from Poisson to Uniform distribution or from Exponential to Weibull. We believe that this is due to its ability to perform in the presence of anomalies as the running Median removes outliers. Without anomalies, E-Divisive performs better. The blue line shows where E-Divisive notifies that there has been a change in distribution.

#install package
install.packages("ecp")
library(ecp)

# ~ 30 min on 2.6Ghz CPU
ediv = e.divisive(as.matrix(all), min.size=200, alpha=1)
plot(as.ts(all))
abline(v=ediv$estimates,col="blue")

distribution-change-E-Divisive

3.5× Greater Speed Than Other Breakout Detection Methods

E-Divisive detects changes in distribution as soon as they occur, but is very slow compared to the EDM algorithm. One of the reasons why EDM is much faster is due to the use of interval trees to approximate the median.

Data pointsE-DivisiveEDM
600029 min24 s
600 7 s1 s

Based on only two tests, we can see that EDM performs much faster than E-Divisive.

Robustness in the Presence of Anomalies

As explained earlier, EDM stand for E-Divisive with Medians. In a previous article, we showed how the running median is robust to anomalies. As a result, an anomaly in the time series isn’t detected as a mean shift.

set.seed(12345)
p = rnorm(20, mean = 1)
all = c(p, 25, p, 30, p, 10, p, 20, p, 70, p , 50, p) #Add Anomalies

res = breakout(all, min.size=40, method='multi', beta=0.001, degree=1, plot=TRUE)
res$plot

anomaly-not-detected

Anomalies aren’t detected as “breakouts” or mean shifts. EDM is, indeed, robust to anomalies.

The post Anomaly Detection with Twitter Breakout appeared first on Anomaly.

]]>
https://anomaly.io/anomaly-detection-using-twitter-breakout/feed/ 0
How to Seasonally Adjust a Time Series in Rhttps://anomaly.io/seasonally-adjustement-in-r/ https://anomaly.io/seasonally-adjustement-in-r/#comments Thu, 03 Dec 2015 20:14:07 +0000 https://anomaly.io/?p=2956 The post How to Seasonally Adjust a Time Series in R appeared first on Anomaly.

]]>

seasonal-adjustment

What is seasonal adjustment? Why is it important, and when should we use it? How should it be done? In this article, we will answer these and many other questions related to seasonality in data.

Why Seasonally Adjust Data ?

Many industries experience fluctuations in various metrics based on the time of year. This means that it is not possible to effectively assess performance by comparing data from one time of year to data from another. Furthermore, these seasonal fluctuations can sometimes be so large that they can mask important business trends hiding in the data.

Consider the sales of an ice cream parlor, for example. It is normal for this sort of business to have higher sales during the warm summer months than during the winter, so you cannot compare January’s sales figures to those from July. Furthermore, the oscillations due to seasonality can be so large that they make business trends hard to see. If sales are up in the most recent month of July, for example, was this due to seasonal variation or an actual increase in sales? To get these answers we need to remove “seasonality” from the data, a process called seasonal adjustment.

Airline Passenger Numbersair-passager-1949-1960

Airline Passenger Numbers – Seasonally Adjustedseasonaly-adjusted-air-passager-1949-1960

The images above show airline passenger numbers from 1949 to 1960. The chart on the left shows the raw data, which illustrates substantial seasonality: people travel more in the summer. The seasonal variations make it difficult to observe potentially noteworthy fluctuations. The seasonally adjusted chart on the right has removed the seasonal patterns from the raw data, allowing us to see the actual trend in passenger activity over time. In so doing, we’ve exposed an anomaly in the year 1960 that was not readily noticeable in the raw data.

Airline Passenger Numbers
air-passager-1996-2015

Airline Passenger Numbers – Seasonally Adjustedseasonaly-adjusted-air-passager-1996-2015

Sometimes an anomaly is so large that it is obvious even in the raw data. For example, the graph above left shows airline passenger numbers from 1996 to 2015. The severe drop in late 2001 was due to the 9/11 attack on the United States, which caused all of North American air space to be closed for several days. But the seasonally adjusted time series on the right makes some more subtle variances easier to see, such as the dip starting in 2008 caused by the Great Recession.

How to Seasonally Adjust Time Series

To seasonally adjust a time series we must first find the seasonality. Performing a time series decomposition will “break down” a time series into multiple sub-time series, one of which will contain the seasonality. The decompose( ) function in R does the heavy lifting here, but there are two requirements to use this function:

  • You must know if you’re dealing with an additive or multiplicative model; read about time series decomposition to know which one to use.
  • You must know the period of the seasonality. This may be a fixed period such as daily, weekly, monthly, or yearly. If unknown, it is also possible to detect the seasonality mathematically.

Another article can help you understand how to perform a time series decomposition by explaining seasonal adjustment formulas:

Additive:
Seasonally Adjusted = Time series – Seasonal

Multiplicative:
Seasonally Adjusted = Time series / Seasonal

install.packages("fpp")
library(fpp)
data(ausbeer)
ts_beer = ts(ausbeer, frequency = 4, start = 1956)

decompose_beer = decompose(ts_beer, "additive")
adjust_beer = ts_beer - decompose_beer$seasonal
plot(adjust_beer)

This seasonality is annual with four data points per years (frequency = 4). As an example, here is the seasonally adjusted time series for Australian beer production:

Australian Beer Production – Seasonally Adjustedseasonaly-adjusted-beer

install.packages("Ecdat")
library(Ecdat)
data(AirPassengers)
ts_air = ts(AirPassengers, frequency = 12, start = 1949)

decompose_air = decompose(ts_air, "multiplicative")
adjust_air = ts_air / decompose_air$seasonal
plot(adjust_air, col = color, lwd = 2)

This is annual seasonality with 12 data points per year (frequency = 12). Here again, as an example, is the seasonally adjusted time series for airline passengers:

Airline Passenger Numbers – Seasonally Adjustedseasonaly-adjusted-air-passager-1996-2015

Conclusion

Seasonally adjusted time series provide a way to understand the underlying trends in data by removing the “noise” of seasonal fluctations so outliers and anomalies are easier to see. Just as removing seasonality makes problems easier to spot with your eyes, it also makes them easier for the computer. This is one of many techniques we use at Anomaly.io to automatically detect problems in monitored data.

The post How to Seasonally Adjust a Time Series in R appeared first on Anomaly.

]]>
https://anomaly.io/seasonally-adjustement-in-r/feed/ 1
Extracting Seasonality and Trend from Data: Decomposition Using Rhttps://anomaly.io/seasonal-trend-decomposition-in-r/ https://anomaly.io/seasonal-trend-decomposition-in-r/#comments Tue, 01 Dec 2015 14:12:02 +0000 https://anomaly.io/?p=2875 The post Extracting Seasonality and Trend from Data: Decomposition Using R appeared first on Anomaly.

]]>

time-series-decomposition-seasonal-trend

Time series decomposition works by splitting a time series into three components: seasonality, trends and random fluctiation. To show how this works, we will study the decompose( ) and STL( ) functions in the R language.

Understanding Decomposition

Decompose One Time Series into Multiple Series

Time series decomposition is a mathematical procedure which transforms a time series into multiple different time series. The original time series is often split into 3 component series:

  • Seasonal: Patterns that repeat with a fixed period of time. For example, a website might receive more visits during weekends; this would produce data with a seasonality of 7 days.
  • Trend: The underlying trend of the metrics. A website increasing in popularity should show a general trend that goes up.
  • Random: Also call “noise”, “irregular” or “remainder,” this is the residuals of the original time series after the seasonal and trend series are removed.

Additive or Multiplicative Decomposition?

To achieve successful decomposition, it is important to choose between the additive and multiplicative models, which requires analyzing the series. For example, does the magnitude of the seasonality increase when the time series increases?

additive-modelAustralian beer production – The seasonal variation looks constant; it doesn’t change when the time series value increases. We should use the additive model.

multiplicative-modelAirline Passenger Numbers – As the time series increases in magnitude, the seasonal variation increases as well. Here we should use the multiplicative model.

Additive:
Time series = Seasonal + Trend + Random

Multiplicative:
Time series = Trend * Seasonal *Random

The decomposition formula varies a little based on the model.

Step-by-Step: Time Series Decomposition

We’ll study the decompose( ) function in R. As a decomposition function, it takes a time series as a parameter and decomposes it into seasonal, trend and random time series. We’ll reproduce step-by-step the decompose( ) function in R to understand how it works. Since there are variations between the two models, we’ll use two examples: Australian beer production (additive) and airline passenger numbers (multiplicative).

Step 1: Import the Data

Additive

As mentioned previously, a good example of additive time series is beer production. As the metric values increase, the seasonality stays relatively constant.

Multiplicative

Monthly airline passenger figures are a good example of a multiplicative time series. The more passengers there are, the more seasonality is observed.

install.packages("fpp")
library(fpp)
data(ausbeer)
timeserie_beer = tail(head(ausbeer, 17*4+2),17*4-4)
plot(as.ts(timeserie_beer))

additive-model

install.packages("Ecdat")
library(Ecdat)
data(AirPassengers)
timeserie_air = AirPassengers
plot(as.ts(timeserie_air))

multiplicative-model

Step 2: Detect the Trend

To detect the underlying trend, we smoothe the time series using the “centred moving average“. To perform the decomposition, it is vital to use a moving window of the exact size of the seasonality. Therefore, to decompose a time series we need to know the seasonality period: weekly, monthly, etc… If you don’t know this figure, you can detect the seasonality using a Fourier transform.

Additive

Australian beer production clearly follows annual seasonality. As it is recorded quarterly, there are 4 data points recorded per year, and we use a moving average window of 4.

Multiplicative

The process here is the same as for the additive model. Airline passenger number seasonality also looks annual. However, it is recorded monthly, so we choose a moving average window of 12.

install.packages("forecast")
library(forecast)
trend_beer = ma(timeserie_beer, order = 4, centre = T)
plot(as.ts(timeserie_beer))
lines(trend_beer)
plot(as.ts(trend_beer))

additive-moving-average
additive-trend

install.packages("forecast")
library(forecast)
trend_air = ma(timeserie_air, order = 12, centre = T)
plot(as.ts(timeserie_air))
lines(trend_air)
plot(as.ts(trend_air))

multiplicative-moving-average
multiplicative-trend

Step 3: Detrend the Time Series

Removing the previously calculated trend from the time series will result into a new time series that clearly exposes seasonality.

Additive

Multiplicative

detrend_beer = timeserie_beer - trend_beer
plot(as.ts(detrend_beer))

additive-detrend

detrend_air = timeserie_air / trend_air
plot(as.ts(detrend_air))

multiplicative-detrend

Step 4: Average the Seasonality

From the detrended time series, it’s easy to compute the average seasonality. We add the seasonality together and divide by the seasonality period. Technically speaking, to average together the time series we feed the time series into a matrix. Then, we transform the matrix so each column contains elements of the same period (same day, same month, same quarter, etc…). Finally, we compute the mean of each column. Here is how to do it in R:

Additive

Quarterly seasonality: we use a matrix of 4 rows. The average seasonality is repeated 16 times to create the graphic to be compared later (see below)

Multiplicative

Monthly seasonality: we use a matrix of 12 rows.
The average seasonality is repeated 12 times to create the graphic we will compare later (see below)

m_beer = t(matrix(data = detrend_beer, nrow = 4))
seasonal_beer = colMeans(m_beer, na.rm = T)
plot(as.ts(rep(seasonal_beer,16)))

additive-seasonality

m_air = t(matrix(data = detrend_air, nrow = 12))
seasonal_air = colMeans(m_air, na.rm = T)
plot(as.ts(rep(seasonal_air,12)))

multiplicative-sesonal

Step 5: Examining Remaining Random Noise

The previous steps have already extracted most of the data from the original time series, leaving behind only “random” noise.

Additive

The additive formula is “Time series = Seasonal + Trend + Random”, which means “Random = Time series – Seasonal – Trend”

Multiplicative

The multiplicative formula is “Time series = Seasonal * Trend * Random”, which means “Random = Time series / (Trend * Seasonal)”

random_beer = timeserie_beer - trend_beer - seasonal_beer
plot(as.ts(random_beer))

additive-random

random_air = timeserie_air / (trend_air * seasonal_air)
plot(as.ts(random_air))

multiplicative-random

Step 6: Reconstruct the Original Signal

The decomposed time series can logically be recomposed using the model formula to reproduce the original signal. Some data points will be missing at the beginning and the end of the reconstructed time series, due to the moving average windows which must consume some data before producing average data points.

Additive

The additive formula is “Time series = Seasonal + Trend + Random”, which means “Random = Time series – Seasonal – Trend”

Multiplicative

The multiplicative formula is “Time series = Seasonal * Trend * Random”, which means “Random = Time series / (Trend * Seasonal)”

recomposed_beer = trend_beer+seasonal_beer+random_beer
plot(as.ts(recomposed_beer))

additive-recomposed

recomposed_air = trend_air*seasonal_air*random_air
plot(as.ts(recomposed_air))

multiplicative-recomposed

DECOMPOSE( ) and STL(): Time Series Decomposition in R

To make life easier, some R packages provides decomposition with a single line of code. As expected, our step-by-step decomposition provides the same results as the DECOMPOSE( ) and STL( ) functions (see the graphs).

Additive

The only requirement: seasonality is quarterly (frequency = 4)

Using the DECOMPOSE( ) function:

Multiplicative

The only requirement: seasonality is monthly (frequency = 12)

ts_beer = ts(timeserie_beer, frequency = 4)
decompose_beer = decompose(ts_beer, "additive")

plot(as.ts(decompose_beer$seasonal))
plot(as.ts(decompose_beer$trend))
plot(as.ts(decompose_beer$random))
plot(decompose_beer)

additive-decompose

ts_air = ts(timeserie_air, frequency = 12)
decompose_air = decompose(ts_air, "multiplicative")

plot(as.ts(decompose_air$seasonal))
plot(as.ts(decompose_air$trend))
plot(as.ts(decompose_air$random))
plot(decompose_air)

multiplicative-decompose

Now using the STL( ) function:

ts_beer = ts(timeserie_beer, frequency = 4)
stl_beer = stl(ts_beer, "periodic")
seasonal_stl_beer   <- stl_beer$time.series[,1]
trend_stl_beer     <- stl_beer$time.series[,2]
random_stl_beer  <- stl_beer$time.series[,3]

plot(ts_beer)
plot(as.ts(seasonal_stl_beer))
plot(trend_stl_beer)
plot(random_stl_beer)
plot(stl_beer)

additive-stl

Conclusion

Decomposition is often used to remove the seasonal effect from a time series. It provides a cleaner way to understand trends. For instance, lower ice cream sales during winter don’t necessarily mean a company is performing poorly. To know whether or not this is the case, we need to remove the seasonality from the time series. Here, at Anomaly.io we detect anomalies, and we use seasonally adjusted time series to do so. We also use the random (also call remainder) time series from the decomposed time series to detect anomalies and outliers.

The post Extracting Seasonality and Trend from Data: Decomposition Using R appeared first on Anomaly.

]]>
https://anomaly.io/seasonal-trend-decomposition-in-r/feed/ 21
Moving Median is Robust to Anomalieshttps://anomaly.io/moving-median-robust-anomaly/ https://anomaly.io/moving-median-robust-anomaly/#comments Thu, 26 Nov 2015 18:34:55 +0000 https://anomaly.io/?p=2846 The post Moving Median is Robust to Anomalies appeared first on Anomaly.

]]>

moving-median

It can be difficult to detect the underlying trend of a time series in the presence of anomalies, due to unwanted noise. Fortunately, there are techniques to take into account those anomalies, so you can work with this kind of time series. One of them is the moving median.

What is Robustness?

Implicit or explicit assumptions are often made in statistics. These might be about the distribution, the independence, the randomness or other characteristics of data. Those assumptions might not be always true, but often are “true enough.For example, we can assume a time series to be normally distributed when in reality it is only almost normally distributed. We can then use this assumption to perform mathematical computations.

So what is robust statistics? It’s all about the “almost”. A non-robust statistical formula will yield results that make no sense when the assumption isn’t 100% true. Robust statistics, on the other hand, will produce results that make sense, perhaps with some small error, even though the assumption wasn’t 100% true. It also means that robust statistic should not be affected by outliers or anomalies.

In Robust Statistics 2nd Edition, Peter J. Hughes and Elvezio M. Ronchetti give this good definition about robust statistics:

“A minor error in the mathematical model should cause only a small error in the final conclusions.”

0470129905

Moving Average Versus Moving Median

Moving averages are commonly used to smooth or remove the noise of a time series. It works well, but the presence of anomalies can affect the underlying trend calculation. Robust statistics shouldn’t be affected by outliers or anomalies. Let’s demonstrate how the moving median formula is a robust statistic.

First, using R language, we create an anomalous seasonal time series:

createDay <- function(noise=0) {
  point=seq(0, pi, by=0.02)
  connection=sin(point)
  noise <- rnorm(length(point), 0, noise)
  return(connection+noise)
}

createDays <- function(totalDays, noise=0) {
  allDays <- c()
  for (day in 1:totalDays ) {
    allDays <- c(allDays, createDay(noise))
  }
  return(allDays)
}

set.seed(10)
days = createDays(3, 0.05)
p1 <- days[0:270]
p2 <- 2.5  # Anomaly here
p3 <- days[(280+2):length(days)]
strangeDay <- append(append(p1, p2), p3)

plot(as.ts(strangeDay), ylim=c(0, 3), col="#e15f3f", lwd = 2)

time-serie-anomaly

Raw time series with outliner

Moving Average Isn’t Robust

The commonly used moving average isn’t robust because it smoothes the anomaly, but doesn’t remove it. As a result, the outlier is “melted” into the size of the windows. In this example, the anomaly is allocated over 3 values.

#install and import lib
install.packages("forecast")
library(forecast)

movingAverage = ma(strangeDay, 3)

plot(as.ts(movingAverage), ylim=c(0, 3), lwd = 2)

time-serie-moving-average

Moving average keeps the outlier

Moving Median Is Robust!

The moving median isn’t as popular as the moving average, but offers some interesting capabilities. The moving median provides a more robust estimate of the trend compare to the moving average. It isn’t affected by outliers: in fact, it removes them!

movingMedian = runmed(strangeDay, 3)
plot(as.ts(), ylim=c(0, 3),
     col="#27ccc0", lwd = 2)

The moving median matches the definition of a robust statistic. “A minor error [the anomaly] in the mathematical model [the moving median] should cause only a small error in the final conclusions”.

time-serie-moving-median

Moving median “removes” the outlier

The post Moving Median is Robust to Anomalies appeared first on Anomaly.

]]>
https://anomaly.io/moving-median-robust-anomaly/feed/ 0
Detecting Seasonality Using Fourier Transforms in Rhttps://anomaly.io/detect-seasonality-using-fourier-transform-r/ https://anomaly.io/detect-seasonality-using-fourier-transform-r/#comments Thu, 06 Aug 2015 10:00:49 +0000 https://anomaly.io/?p=2803 The post Detecting Seasonality Using Fourier Transforms in R appeared first on Anomaly.

]]>

detect seasonality

Our brains are really fast at recognizing patterns and forms: we can often find the seasonality of a signal in under a second. It is also possible do this with mathematics using the Fourier transform.

First, we will explain what a Fourier transform is. Next, we will find the seasonality of a website from its Google Analytics pageview report using the R language.

Fourier Transforms Explained

The Fourier transform decomposes a signal into all the possible frequencies that comprise it. Fine, but how does it really work? (And keep it simple, please?)

 

1 – Pick a Frequency

First, the Fourier transform starts with the smallest frequency as possible. For a signal made of 100 points, the smallest frequency possible is 1/100 = 0.01 Hz. Think of a circle turning at a speed of 0.01 Hz,  or 0.01 second if the points are recorded every second. Just like a clock.

herz

1 circle turn = 0.01 s

 

2 – “Draw” the Signal Value on the Circle

Using the previously selected frequency we “draw” (decompose) the entire signal on the circle.

decompose

1 circle turn = 0.01 s

Notice: When the signal measurement is high, the clock arm of the circle is high.
The above signal will draw something like this:

draw

Signal decomposition at 0.01 Hz

 

3 – Compute the Periodogram

All the measurements are now on the circle in two dimensions. We call these “vectors”. Summing the vectors together give the final “power” of the frequency. This occurs when all the vectors line up and point in the same direction, creating high values that represent the “power” of the frequency.

4 – Repeat with Different Frequencies

To complete the Fourier transform we repeat the process with different frequencies: 0.02 Hz, 0.03 Hz … We continue until all the “power” of each possible frequency has been computed. The frequencies with the highest power represent the greatest periodicity!

Detecting Seasonality (Example)

We just saw that the frequency with the highest “power” represents the primary seasonality of our underlying metric. In this example, William Cox draws at high speed the signal value on the circle for many different frequencies.

Here is how the original signal value appears:
graph

To learn more, I recommend litening to William Cox talk about Fourier transforms:

In this example, a frequency of 2.1 gets a very high “power” value. This is where all the data line up and create a big vector.

From the frequency formula:
T = 1 / f
T = 1 / 2.1 Hz = 476 ms

The signal repeats every 476 ms. This is its seasonality!

Detecting Seasonality using R

My personal tech blog clearly shows some weekly trends: It receives much less traffic during the weekend. As a result, my Google Analytics report shows some sort of weekly periodicity. Let’s try to find the seasonality using the R language.

# Install and import TSA package
install.packages("TSA")
library(TSA)

# read the Google Analaytics PageView report
raw = read.csv("20131120-20151110-google-analytics.csv")

# compute the Fourier Transform
p = periodogram(raw$Visite)

periodogram

The periodogram shows the “power” of each possible frequency, and we can clearly see spikes at around frequency 0.15 Hz

dd = data.frame(freq=p$freq, spec=p$spec)
order = dd[order(-dd$spec),]
top2 = head(order, 2)

# display the 2 highest "power" frequencies
top2

freq spec
0.142661180 167143.1
0.002743484 109146.8

Frequencies of 0.142661180 Hz and 0.002743484 Hz show seasonality!

# convert frequency to time periods
time = 1/top2$f
time

> time
[1] 7.009615 364.500000

The main seasonality detected is 7 days. A secondary seasonality of 364.4 days was also found. Exactly what we expected! My blog has weekly seasonality as well as annual seasonality.

The post Detecting Seasonality Using Fourier Transforms in R appeared first on Anomaly.

]]>
https://anomaly.io/detect-seasonality-using-fourier-transform-r/feed/ 5
Anomaly Detection Using K-Means Clusteringhttps://anomaly.io/anomaly-detection-clustering/ https://anomaly.io/anomaly-detection-clustering/#comments Tue, 30 Jun 2015 12:58:19 +0000 https://anomaly.io/?p=2655 The post Anomaly Detection Using K-Means Clustering appeared first on Anomaly.

]]>

kmean anomaly

Monitored metrics very often exhibit regular patterns. When the values are correlated with the time of the day, it’s easier to spot anomalies, but it’s harder when they do not. In some cases, it is possible to use machine learning to differentiate the usual patterns from the unusual ones.

Regular Patterns

Many metrics follow regular patterns correlated to time. Websites, for example, commonly experience high activity during the day and low activity at night. This correlation with time makes it easy to spot anomalies, but it’s more difficult when that correlation is not present.

1. With Time Correlation

When the metric is correlated to time, the key point is to find its seasonality. When today’s pattern is the same as yesterday, the seasonality is daily. It is also common to find seasonality of one week because Saturday’s patterns often don’t follow Friday’s, but rather those of the Saturday of the previous week. Because many metrics are correlated to human activity, seasonality is often weekly. But it could also be measured in seconds, minutes, hours, month or an arbitrary period.

Subtracting the usual pattern (seasonality) from the overall signal should create a new “difference metric”. Due to randomness, the resulting signal should be pure noise with a mean of zero. Under such conditions, we can apply the normal distribution to detect anomalies. Any values that are too high or too low can be treated as anomalous.

substract seasonality

2.Without Time Correlation

A heartbeat has many recurring patterns. The heart beats on average every 0.8s, but this is an average period, not a fixed period of time. Moreover, the period and the value of the signal might change a lot due to physical activity, stress or other effects. So it isn’t possible to just use a period of 0.8 seconds.

So what can we do? Machine learning to the rescue!

Machine Learning Detection

We can group similar patterns into categories using machine learning. Then, we subtract each new beat with its closest category. If the original beat and the category beat are very similar, the result should be pure noise with a mean of zero. As above, it’s now possible to apply the normal distribution to detect anomalies. For an anomalous beat, even the closest category will still be very different. The anomaly will be easy to detect as it will create a peak in the “difference metric”.

This requires 4 steps:

  1. Sliding Window
  2. Clustering
  3. Noise Transform
  4. Detect Anomalies

1. Sliding Window

The first step is to “cut” the normal heartbeat signal into smaller chunks. To get better results later, we are going to use a sliding window with an overlap of 50%. For each window, we will apply a Hamming function to clean the signal.
In the following heartbeat implementation, we choose a sliding windows of 60 values represented with a red rectangle. So, we will move the window by 30 values. For each window, we multiply the heartbeat signal (in green) by the hamming function (in red). The end result (in green) is saved to create categories in the next step.

hamming sliding window

2. Clustering

The next step is to group together similar patterns produced by the sliding window. We will use one machine learning technique known as k-means clustering using Matlab/Octave or Mahout. This will cluster our signal into a catalogue of 1000 categories. In the following schema, some categories are plotted.

kmean clusters2

kmean clusters2

3. Noise Transform

For each sliding window, we select the closest category from our catalogue of patterns created previously. Subtracting the sliding window from its category should produce a simple noisy signal. If the signal is repeating and enough clusters were chosen, the whole signal can be turned into a normally distributed noisy signal with a mean of zero . Because this noisy signal is normally distributed we can apply the normal distribution to detect anomalies. We can plot its histogram, or use the 99.999% percentile, while highlighting the threshold limit to apply on the noisy signal.

substract window to cluster

normal distribution

kmean histogram

4. Detect Anomalies

For each new sliding window, we repeat the above process to turn it into a noise signal. Previously we defined minimum and maximum thresholds. The noisy sliding window should be within those limits; if not, this is an anomaly! But that’s not all. We also defined the noise signal as being normally distributed with a mean of zero. If it suddenly changes into another distribution, something unusual or anomalous is appending.

heart anomaly animationMin / Max anomaly

graph normal distribution changeddistribution anomaly

normal distribution changeddistribution anomaly

Implementation

1. Requirement

First, you need to download and install Octave (recommended) or Matlab. A few packages are will be used in this tutorial. Open an Octave or Matlab shell, install and import these packages:

% Install package
pkg install -forge io
pkg install -forge statistics
pkg install -forge control
pkg install -forge signal

% Import package
pkg load signal
pkg load statistics

2. Sliding Window

You could use the WFDB Toolbox from physionet.org to download a heartbeat signal and convert it into Octave/Matlab format. But as it’s easier to simply download the converted file “16773m.mat”

Load the signal in Octave/Matlab , convert into sliding windows, and apply the Hamming function:

% change with your own path
cd /path/to/16773m.mat

% load &amp; plot some heartbeat
signal = load("16773m.mat").val(1,:);
plot(signal(1:1000))

% convert to sliding windows
% http://www.mathworks.com/help/signal/ref/buffer.html
windows = buffer(signal, 60, 30);

% apply hamming function
% http://www.mathworks.com/help/dsp/ref/windowfunction.html
mutate = windows.*hamming(60);
points = mutate.';
cleanPoints = points(2:end-1, :);

% save for Mahout
dlmwrite('heatbeat.txt',cleanPoints,'delimiter',' ','precision','%.4f');

All sliding windows are saved in memory into the “cleanPoints” variable and also the “heatbeat.txt” text file. The variable will be used for clustering into Octave/Matlab; the text file will be used for clustering using Mahout.

3. Clustering

Let’s now find common patterns from the signal. We will use the k-means clustering technique, which is part of the machine learning field. It works by grouping together points based on their nearest mean.

.

using octave or mahout

.

3.1. Clustering with Octave or Matlab

When data can fit into RAM, Octave or Matlab is a good choice. When clustering a small quantity of data, such as this heartbeat signal, you should use Octave or Matlab.

Octave and Matlab come with a k-means implementation in the statistics package. As it’s quite slow we will use a faster implementation call Fast K-means. Download fkmeans.m and save the file into the directory of your Octave / Matlab workspace. Use the “pwd” command to find your workspace path.

If you save fkmeans.m in your current workspace, you should be able to create 1000 clusters:

[~, clusters, ~] = fkmeans(cleanPoints, 1000);

It should take around 30 minutes, so let’s save the file, just in case.

save "1000_clusters_fkmeans.mat" clusters -ascii;

If you have any trouble creating this file, you can download 1000_clusters_fkmeans.mat and simply load it within Octave or Matlab:

clusters = load("1000_clusters_fkmeans.mat");

3.2. Clustering with Mahout

When the data you went to cluster is huge and won’t fit into memory, you should choose a tool designed for this use case, such as Mahout.

First, download and install Mahout. Previously we created a file call “heatbeat.txt” containing all the sliding windows. You have to copy this file into a directory call “testdata”. Then, run the Mahout clustering from your shell:

mkdir testdata
cp /path/to/heatbeat.txt testdata/
mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job --input testdata --output out_1000 --numClusters 1000 --t1 200 --t2 100 --maxIter 50

A few hours later, the clustering result will be written into the “out_1000″ directory. Now export the Mahout result into Octave or Matlab. Let’s use cluster2csv.jar to convert this Mahout file into a CSV file.

java -jar cluster2csv.jar out_1000/clusters-50-final/part-r-00000 &gt; 1000_clusters_mahout.csv

Finaly, let’s import 1000_clusters_mahout.csv into Octave or Matlab:

clusters = load("1000_clusters_fkmeans.mat");

4. Noise Transform

To transform the original signal into a noise signal, we need a few steps. First, we need a function that can find the closest cluster for a given signal window:

function clusterIndices = nearKmean(clusters, data)
numObservarations = size(data)(1);
K = size(clusters)(1);
D = zeros(numObservarations, K);
for k=1:K
 %d = sum((x-y).^2).^0.5
 D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
[minDists, clusterIndices] = min(D, [], 2);
end

We need to subtract the window from the closest cluster:

function diffWindows = diffKmean(clusterIndices, clusters, data)
diffWindows = [];
for i = 1:length(clusterIndices)
 entry = data(i,:);
 cluster = clusters(clusterIndices(i),:);
 diff = cluster-entry;
 diffWindows = [diffWindows; diff];
end
end

And finally, a function to reconstruct the “difference signal”, also call the “error signal”:

function reconstruct = asSignal(diffWindows)
rDiffWindows = diffWindows.';
p1 = rDiffWindows(1:size(rDiffWindows,1)/2,2:end);
p2 = rDiffWindows((size(rDiffWindows,1)/2)+1:size(rDiffWindows,1),1:end-1);
reconstruct = reshape(p1.+p2, [], 1)(:)';
end

With a sample, let’s apply the above function to draw the error signal:

samples = cleanPoints(20200:20800, :);
clusterIndices = nearKmean(clusters, samples);
diffWindows = diffKmean(clusterIndices, clusters, samples);
reconstructDelta = asSignal(diffWindows);
plot(reconstructDelta)
ylim ([-200 600])

graph error signal

5. Detect Anomaly

Let’s simulate the beginning of a heart attack and detect the anomaly.

attack = cleanPoints(10002, :);
attack(35:43) = [100,350,325,250,410,380,300,200,100];
clusterId = nearKmean(clusters, attack);
cluster = clusters(clusterId,:);

plot(attack, "linewidth",1.5, 'Color', [0.1484375,0.7578125,0.71484375]);
ylim([-200 500]);
hold on;
plot(cluster, "linewidth",1.5, 'Color', [0.875486381,0.37109375,0.24609375]);
hold off;

heart attack

Clearly, there is a huge difference between the heart attack and its closest pattern. This plot highlights the anomaly.

diff = attack-cluster;
plot(diff, "linewidth",1.5, 'Color', "black");
ylim([-200 500]);

heart attack diff

Previously we established more rules than a simple minimum and maximum. The noisy “error signal” should follow a mean of zero with a specific normal distribution. Let’s simulate this kind of dysfunction.

p1 = cleanPoints(10001:10020, :);
p2 = cleanPoints(10020:10040, :);
p2 = p2.-(p2*1.2)+15;
samples = [p1; p2];
clusterIndices = nearKmean(clusters, samples);
diffWindows = diffKmean(clusterIndices, clusters, samples);
reconstructDelta = asSignal(diffWindows);
plot(reconstructDelta, "linewidth",1.5, 'Color', "black");
ylim([-200 500]);

not-signal-normal-distribution

It doesn’t look normally distributed anymore: for sure, it’s not the same distribution as it usually is. It’s an anomaly.

clusterIndices = nearKmean(clusters, p2);
diffWindows = diffKmean(clusterIndices, clusters, p2);
reconstructDelta = asSignal(diffWindows);
plot(reconstructDelta, "linewidth",1.5, 'Color', "black");
hist(reconstructDelta, 50);

not normal distribution

Conclusion

This technique is useful only under certain conditions. You should apply this technique if the signal doesn’t repeat itself within a fixed period of time. If it does, such as daily or weekly, it is preferable to simply model the seasonality to detect anomalies.

We have only scratched the surface here when studying the error signal as being normally distributed. In reality there are more ways of detecting anomalies in a normally distributed signal.

The post Anomaly Detection Using K-Means Clustering appeared first on Anomaly.

]]>
https://anomaly.io/anomaly-detection-clustering/feed/ 2