Detecting Anomalies with Moving Median Decomposition

anomaly-decomposition-median

Time series decomposition splits a time series into seasonal, trend and random residual time series. The trend and the random time series can both be used to detect anomalies. But detecting anomalies in an already anomalous time series isn’t easy.

TL;DR

When working on an anomalous time series:

Anomaly detection with moving average decomposition doesn’t work
Anomaly detection with moving median decomposition works

The Problem with Moving Averages

In the blog entry on time series decomposition in R, we learned that the algorithm uses a moving average to extract the trends of time series. This is perfectly fine in time series without anomalies, but in the presence of outliers, the moving average is seriously affected, because the trend embeds the anomalies.

In this demonstration, we will first detect anomalies using decomposition with a moving average. We’ll see that this doesn’t work well, and so will try detecting anomalies using decomposition with a moving median to get better results.

About the data: webTraffic.csv reports the number of page view per day over a period of 103 weeks (almost 2 years). To make the data more interesting, we added some (extra) anomalies to it. Looking at the time series, we clearly see a seasonality of 7 days; there is less traffic on weekends.

To decompose a seasonal time series, the seasonality time period is needed. In our example, we know the seasonality to be 7 days. If unknown, it is possible to determine the seasonality of a time series. Last but not least, we need to know if the time series is additive or multiplicative. Our web traffic is multiplicative.

To sum up our web traffic:

Seasonality of 7 days (over 103 weeks)
Multiplicative time series
Download here: webTraffic.csv

set.seed(4)
data <- read.csv("webTraffic.csv", sep = ",", header = T)
days = as.numeric(data$Visite)
for (i in 1:45 ) {
 pos = floor(runif(1, 1, 50))
 days[i*15+pos] = days[i*15+pos]^1.2
}
days[510+pos] = 0
plot(as.ts(days))

set.seed(4)

data <- read.csv("webTraffic.csv", sep = ",", header = T)

days = as.numeric(data$Visite)

for (i in 1:45 ) {

pos = floor(runif(1, 1, 50))

days[i*15+pos] = days[i*15+pos]^1.2

}

days[510+pos] = 0

plot(as.ts(days))

webTraffic-anomalies

Moving Average Decomposition (Bad Results)

Before going any further, make sure to import the data. Also, I recommend being sure that you understand how time series decomposition works. The stats package provides the handy decompose function in R.

1 – Decomposition

As the time series is anomalous during the decomposition, the trends become completely wrong. Indeed, the anomalies are averaged into the trend.

# install required library
install.packages("FBN")
library(FBN)

decomposed_days = decompose(ts(days, frequency = 7), "multiplicative")
plot(decomposed_days)

# install required library

install.packages("FBN")

library(FBN)

decomposed_days = decompose(ts(days, frequency = 7), "multiplicative")

plot(decomposed_days)

moving-average-decomposition

2 – Using Normal Distribution to Find Minimum and Maximum

To the random noise we can apply the normal distribution to detect anomalies. Values under or over 4 times the standard deviation can be considered outliers.

random = decomposed_days$random
min = mean(random, na.rm = T) - 4*sd(random, na.rm = T)
max = mean(random, na.rm = T) + 4*sd(random, na.rm = T)

plot(as.ts(as.vector(random)), ylim = c(-0.5,2.5))
abline(h=max, col="#e15f3f", lwd=2)
abline(h=min, col="#e15f3f", lwd=2)

random = decomposed_days$random

min = mean(random, na.rm = T) - 4*sd(random, na.rm = T)

max = mean(random, na.rm = T) + 4*sd(random, na.rm = T)

plot(as.ts(as.vector(random)), ylim = c(-0.5,2.5))

abline(h=max, col="#e15f3f", lwd=2)

abline(h=min, col="#e15f3f", lwd=2)

moving-average-limits

3 – Plotting Anomalies

Let’s find the values either over or under 4 standard deviations to plot the anomalies. As expected, the calculation fails to find many anomalies, due to the use of moving average by the decomposition algorithm.

#find anomaly
position = data.frame(id=seq(1, length(random)), value=random)
anomalyH = position[position$value > max, ]
anomalyH = anomalyH[!is.na(anomalyH$value), ]
anomalyL = position[position$value < min, ]
anomalyL = anomalyL[!is.na(anomalyL$value), ]
anomaly = data.frame(id=c(anomalyH$id, anomalyL$id),
 value=c(anomalyH$value, anomalyL$value))
anomaly = anomaly[!is.na(anomaly$value), ]

plot(as.ts(days))
real = data.frame(id=seq(1, length(days)), value=days)
realAnomaly = real[anomaly$id, ]
points(x = realAnomaly$id, y =realAnomaly$value, col="#e15f3f")

#find anomaly

position = data.frame(id=seq(1, length(random)), value=random)

anomalyH = position[position$value > max, ]

anomalyH = anomalyH[!is.na(anomalyH$value), ]

anomalyL = position[position$value < min, ]

anomalyL = anomalyL[!is.na(anomalyL$value), ]

anomaly = data.frame(id=c(anomalyH$id, anomalyL$id),

value=c(anomalyH$value, anomalyL$value))

anomaly = anomaly[!is.na(anomaly$value), ]

plot(as.ts(days))

real = data.frame(id=seq(1, length(days)), value=days)

realAnomaly = real[anomaly$id, ]

points(x = realAnomaly$id, y =realAnomaly$value, col="#e15f3f")

moving-average-anomalies

Moving Average Decomposition (Good Results)

In our second approach, we will do the decomposition using a moving median instead of a moving average. This requires understanding how the moving median is robust to anomalies and how time series decomposition works. Also, again make sure to import the data. The moving median removes the anomalies without altering the time series too much.

1 – Applying the Moving Median

The decomposed trend with moving median (see above graphic) is a bit “raw” but expresses the trend better.

#install library
install.packages("forecast")
library(forecast)
library(stats)

trend = runmed(days, 7)
plot(as.ts(trend))

#install library

install.packages("forecast")

library(forecast)

library(stats)

trend = runmed(days, 7)

plot(as.ts(trend))

moving-median-trends

2 – Using the Normal Distribution to Find the Minimum and Maximum

As the decomposition formula expresses, removing the trend and seasonality from the original time series leaves random noise. As it should be normally distributed, we can apply the normal distribution to detect anomalies. As we saw previously, values under or over 4 times the standard deviation can be considered outliers. As random time series still contain the anomalies, we need to estimate the standard deviation without taking the anomalies into account. Once again, we will use the moving median to exclude the outliers.

detrend = days / as.vector(trend)
m = t(matrix(data = detrend, nrow = 7))
seasonal = colMeans(m, na.rm = T)
random = days / (trend * seasonal)
rm_random = runmed(random[!is.na(random)], 3)

min = mean(rm_random, na.rm = T) - 4*sd(rm_random, na.rm = T)
max = mean(rm_random, na.rm = T) + 4*sd(rm_random, na.rm = T)
plot(as.ts(random))
abline(h=max, col="#e15f3f", lwd=2)
abline(h=min, col="#e15f3f", lwd=2)

detrend = days / as.vector(trend)

m = t(matrix(data = detrend, nrow = 7))

seasonal = colMeans(m, na.rm = T)

random = days / (trend * seasonal)

rm_random = runmed(random[!is.na(random)], 3)

min = mean(rm_random, na.rm = T) - 4*sd(rm_random, na.rm = T)

max = mean(rm_random, na.rm = T) + 4*sd(rm_random, na.rm = T)

plot(as.ts(random))

abline(h=max, col="#e15f3f", lwd=2)

abline(h=min, col="#e15f3f", lwd=2)

moving-median-limits

3 – Plotting Anomalies

As before, values over or under 4 times the standard deviation are plotted as anomalous. This works much better; we detect almost every anomaly! Once again, this is because the moving median is robust to anomalies.

#find anomaly
position = data.frame(id=seq(1, length(random)), value=random)
anomalyH = position[position$value > max, ]
anomalyH = anomalyH[!is.na(anomalyH$value), ]
anomalyL = position[position$value < min, ]
anomalyL = anomalyL[!is.na(anomalyL$value)]
anomaly = data.frame(id=c(anomalyH$id, anomalyL$id),
 value=c(anomalyH$value, anomalyL$value))
points(x = anomaly$id, y =anomaly$value, col="#e15f3f")

plot(as.ts(days))
real = data.frame(id=seq(1, length(days)), value=days)
realAnomaly = real[anomaly$id, ]
points(x = realAnomaly$id, y =realAnomaly$value, col="#e15f3f")

#find anomaly

position = data.frame(id=seq(1, length(random)), value=random)

anomalyH = position[position$value > max, ]

anomalyH = anomalyH[!is.na(anomalyH$value), ]

anomalyL = position[position$value < min, ]

anomalyL = anomalyL[!is.na(anomalyL$value)]

anomaly = data.frame(id=c(anomalyH$id, anomalyL$id),

value=c(anomalyH$value, anomalyL$value))

points(x = anomaly$id, y =anomaly$value, col="#e15f3f")

plot(as.ts(days))

real = data.frame(id=seq(1, length(days)), value=days)

realAnomaly = real[anomaly$id, ]

points(x = realAnomaly$id, y =realAnomaly$value, col="#e15f3f")

moving-median-anomalies

Here is an alternative plot displaying the same results: values outside the blue area can be considered anomalous.

high = (trend * seasonal) * max
low = (trend * seasonal) * min
plot(as.ts(days), ylim = c(0,200))
lines(as.ts(high))
lines(as.ts(low))

x = c(seq(1, length(high)), seq(length(high), 1))
y = c(high, rev(low))
length(x)
length(y)
polygon(x,y, density = 50, col='skyblue')

high = (trend * seasonal) * max

low = (trend * seasonal) * min

plot(as.ts(days), ylim = c(0,200))

lines(as.ts(high))

lines(as.ts(low))

x = c(seq(1, length(high)), seq(length(high), 1))

y = c(high, rev(low))

length(x)

length(y)

polygon(x,y, density = 50, col='skyblue')

Monitor & detect anomalies with Anomaly.io

Aleksandras Urbonas
Well done, I like it.
Could you fix the code: anomalyL[!is.na(anomalyL$value)], because all columns are being selected, and a comma is missing. right before the closing bracket: anomalyL[!is.na(anomalyL$value) , ]
Have a lovely day,
Sana Munawar
m = t(matrix(data = detrend, nrow = 7))
rm_random = runmed(random[!is.na(random)], 3)
I am not sure what do numbers 7 and 3 mean in the above lines. My data is with frequence 23 observations per year. I have total 395 observations. Sorry for my ignorance but I dont understand what values should I use instead.
- Naviden
  “7” is the seasonality period –> weekly data –> 7 days
  after using runmed(x,7) we remove the outlier effect from trend so the random part will have the outlier effect –> raw data(has outlier) / trend(no outlier) / seasonality(no outlier) = random(with outlier effect)
  so we need to use runmed() again to remove it. I don’t know why exactly the author used 3 periods but I think it wont change the results too much.
  the only thing to note is that the above “raw data(has outlier) / trend(no outlier) / seasonality(no outlier) = random(with outlier effect)” doesn’t accurately represent the method showed in this article because the author has used colMean() for calculating the seasonality which results in having a dirty seasonality (containing outlier effect) while I think “colMedians()” should be used instead (robustbase library)
Sana Munawar
Thank you for this amazing article and other articles on this topic.
Bob Doe
hi I had a question about this line:
rm_random = runmed(random[!is.na(random)], 3)
Why are we setting the window to 3 in this instance? Is it just arbitrary?
Demetrios Eliades
Hi Martin, why didn’t you use the moving median filter for the random variable in the first example, in the same way you did in the second? The results appear to be similar when applied.