Anomaly https://anomaly.io Just another WordPress site Fri, 21 Dec 2018 19:16:49 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.2 Salesforce Anomaly Detection Using Anomaly.iohttps://anomaly.io/salesforce-anomaly-detection/ https://anomaly.io/salesforce-anomaly-detection/#comments Mon, 30 Jan 2017 14:42:52 +0000 https://anomaly.io/?p=3267 The post Salesforce Anomaly Detection Using Anomaly.io appeared first on Anomaly.

]]>

detect anomalies in salesforce

Monitoring Key Performance Indicators (KPIs) is essential to running a successful business. As one example, you should examine your lead generation KPI several times a day, to allow you to detect and correct problems as quickly as possible.

But you’re busyyou don’t have time to watch KPI indicators all day long. That’s where Anomaly.io comes in. By combining our detection algorithms with your Salesforce data, you can automatically detect problems and notify the appropriate personnel to ensure that speedy corrective action is taken.

From Salesforce to InfluxDB (Anomaly.io Protocol)

Let’s see how we can forward Salesforce data to Anomaly.io for automatic anomaly detection.

First, install the StreamSets Data Collector, which can easily subscribe to and stream Salesforce events. Next, log into your fresh StreamSets install with the default login/password combination “admin/admin”.
Now install the packages “Salesforce Library” and “InfluxDB 0.9+” from the PackageManager:

StreamSets package manager

Create a new Pipeline. Click the “Stage Library” (top right icon), select the origin “Salesforce”, then the processor “Groovy Evaluator”, and finally, the destination “InfluxDB”.

salesforce origin

Now link the nodes together:

Anomaly Flow Salesforce InfluxDB

Salesforce Node

This node collects historical “Opportunities” and subscribes to upcoming ones through the Salesforces Streaming API

Enter this data on the Salesforce tab:

  • Username: salesforceUserName
  • Password: MyPassword+securityToken
    e.g: MyPassword1FPeZD9z[…]
    Reset your salesforce security token
  • API version: 36.0
  • Query Existing Data: Yes
  • Subscribe for Notifications: Yes

IMPORTANT: Be sure to enter “36.0” for the API, as currently this is the only compatible version.

On the Query tab:

  • Use Bulk API: Yes
  • SOQL Query:
    SELECT Id, AccountId, CreatedDate, CloseDate, Amount
    FROM Opportunity WHERE Id > ‘${OFFSET}’ ORDER BY Id

And finally, on the Subscribe tab:

  • Push Topic: OpportunityCreate

Groovy Node

This node transforms Salesforce output into InfluxDB input. You need to do the following:

  • Set the required AccountId, if it is missing.
  • Translate Salesforce time into InfluxDB time.
  • Set a “measurement” to save the events into.

Example code for the Groovy tab:

for (record in records) {
 try {
  record.value['measurement'] = 'opportunity'
  Date date = Date.parse("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", record.value['CreatedDate'])
  record.value['CreatedDate'] = date.getTime()
  if(record.value['AccountId'] == null){
   record.value['AccountId'] = "unknow"
  }
  output.write(record)
 } catch (e) {
  log.error(e.toString(), e)
  error.write(record, e.toString())
 }
}

InfluxDB Node

This node sends the data to Anomaly.io, which uses the InfluxDB protocol, for automatic anomaly detection.

Enter the following on the InfluxDB tab:

  • URL: http://events.anomaly.io:8086
  • User: yourAnomalyUser
  • Password: yourAnomalyPasswd
  • Database Name: salesforce
  • Auto-create Database: Yes
  • Record Mapping: Custom Mappings
  • Measurement Field: /measurement
  • Tag Fields: /AccountId
  • Value Fields: /Amount /CreatedDate /CloseDate

Register Salesforce PushTopic

In our StreamSets configuration, the Salesforce origin will request all historical data, then subscribe to a channel we called “OpportunityCreate“. To set this up, we need to create this PushTopic in Salesforce:

  1. Login to Salesforce
  2. Open the Developer Console.
  3. Select Debug | Open Execute Anonymous Window.
  4. In the Enter Apex Code window, paste in the following Apex code, then click Execute.

PushTopic pushTopic = new PushTopic();
pushTopic.Name = 'OpportunityCreate';
pushTopic.Query = 'SELECT Id, AccountId, CreatedDate, CloseDate, Amount FROM Opportunity';
pushTopic.ApiVersion = 36.0;
pushTopic.NotifyForOperationCreate = true;
pushTopic.NotifyForOperationUpdate = false;
pushTopic.NotifyForOperationUndelete = false;
pushTopic.NotifyForOperationDelete = false;
pushTopic.NotifyForFields = 'Referenced';
insert pushTopic;

Run Streaming from Salesforce to InfluxDB (or Anomaly.io)

Click outside any node, then go to Configuration > Error Records and select “(Library: Basic)”.  Now we can finally run the pipeline:

Datasets run Anomaly Detection

You should see something like the following. Note that here a total of 32 “Opportunities” were processed, and StreamSets is waiting for new events to come in:

anomaly pipe

Salesforce historical data has now been imported into InfluxDB. What about future “Opportunities”? Let’s check to make sure the Salesforce Streaming API works:

  1.  Login to workbench.developerforce.com
  2. Go to Data > Insert | Object Type “Opportunity” | Next
  3. Enter:
    AccountId: 0010Y00000DX4EgQAL
    Amount: 100
    CloseDate: 2019-05-27T10:43:35.000Z
    Name: My Opportunity
    StageName: Open

Data should be processed within one second and then sent to InfluxDB or Anomaly.io.

Detect Anomalies in Salesforce

Using Anomaly.io

You can spend a lot of money on R&D, or you can save that money and use anomaly.io to detect exceptions and issues in your data. After a week of intense machine learning, our algorithm will pick the best anomaly detection method for your data. Our system can automatically apply techniques such as k-mean clustering, anomaly correlation, breakout detection and trend decomposition in real time, along with others.

Simply forward your data to anomaly.io; we’ll keep an eye on it.

Using Custom Anomaly Detection

Anomaly detection is complex, and especially difficult when you need to work with unknown problems. Your users and business will evolve, so you can’t rely on just setting simple thresholds.

Before you start, always check to make sure that your data is compatible with the algorithms you are using. For example, a common error is to use an algorithm intended for normally distributed time series when your data isn’t normally distributed.

You can also forward your Salesforce data into an open source solution such as Twitter Anomaly Detection, Twitter Breakout, or Skyline, or develop your own.

The post Salesforce Anomaly Detection Using Anomaly.io appeared first on Anomaly.

]]>
https://anomaly.io/salesforce-anomaly-detection/feed/ 0
Detecting Anomalies in Correlated Time Serieshttps://anomaly.io/detect-anomalies-in-correlated-time-series/ https://anomaly.io/detect-anomalies-in-correlated-time-series/#comments Wed, 25 Jan 2017 15:22:24 +0000 https://anomaly.io/?p=3232 The post Detecting Anomalies in Correlated Time Series appeared first on Anomaly.

]]>

anomaly in time series correlation

Monitoring key performance indicators (KPIs), sales or any other product data means working within an ecosystem where very often you will see metrics correlating with each other. When a normal correlation between two metrics is broken, we have reason to suspect something strange is happening.

As an example, take a look at anomaly.io analytics during its early days (a long time ago). In the graphic above, the new users are shown in green and the returning users in red. Clearly, something strange happens in the middle of November. Let’s use some techniques to find out more!

Step 1: Collect the Data

With Google Analytics it’s fairly easy to download data on new and returning visitors. Here at Anomaly.io we detect problems in real time, so we prefer streaming our Google Analytics in real time. But for the sake of simplicity, we will use a plain CSV file for this exercise.

You can download your own Google Analytics data, or use my sample monitoring data file if you wish: new-vs-returning-visitor.csv.

google-analytics-correlation

#R language
install.packages("dplyr")
library(dplyr)

#load data
csv = read.csv("/path/to/new-vs-returning-visitor.csv")

#load new visitor
new = filter(csv,Segment=="New Users")
new_data = new[4]$Sessions
new_date = as.Date(lapply(new[1], as.character)$Day_Index, "%m/%d/%y")
new_df = data.frame(date=new_date, value=new_data)

#load returning visitor
return = filter(csv,Segment=="Returning Users")
return_data = return[4]$Sessions
return_date = as.Date(lapply(return[1], as.character)$Day_Index, "%m/%d/%y")
return_df = data.frame(date=return_date, value=return_data)

#plot
plot_df = function(new_df, return_df = NULL, ylim=c()) {
	plot(value ~ date, new_df, xaxt="n", type="l", col="#00c7b0", ylim=ylim, lwd=2)
	date_axis = new_df$date[1:(length(new_df$date)/30.5)*30.5]+2
	axis(1, date_axis, format(date_axis, "%b"), cex.axis = .7)
	if(!is.null(return_df)) {
		lines(value ~ date, return_df, xaxt="n", type="l", col="#ff0816", lwd=2)
	}
}
plot_df(new_df, return_df)
legend("topleft", c("new visitor","returning visitor"),
      col=c("#00c7b0","#ff0816"), lty=c(1,1))

anomaly correlation R

Now you have two variables in your R environment:

  • new_df: new visitors
  • return_df: returning visitors

Step 2: Verify the Cross-Correlation

Our analysis is based on the time series being correlated, so before going any further, let’s ensure that this is the case. To do so, we need to check using Cross-Correlation. Check this article if you want to Understand the Cross-Correlation algorithm.

BONUS: Find correlated metrics in multiple times series

library(stats)
library(forecast)
ma_new_data = ma(new_data, 3)
ma_new_data = ma_new_data[!is.na(ma_new_data)]
ma_return_data = ma(return_data, 3)
ma_return_data = ma_return_data[!is.na(ma_return_data)]

corr = ccf(ma_new_data, ma_return_data, lag.max = 0)
corr # display 0.876

As the Moving Average is robust to anomaly we use it to remove potential outliers before computing the correlation. The Cross Correlation Function (CCF) is a very high value of 0.876. Clearly, the time series are correlated.

Step 3: Subtract the Time Series

We can compute that there are 1.736 times more new visitors than returning visitors. Let’s align the metrics at the same level.

multi = sum(ma_new_data/ma_return_data)/length(ma_return_data)
multi #display 1.736529
align_return_df = return_df
align_return_df$value = align_return_df$value*multi
plot_df(new_df, align_return_df)
legend("topleft", c("new visitor","returning visitor x1.736"), 
     col=c("#00c7b0","#ff0816"), lty=c(1,1))

align correlation

substract = return_df
substract$value = new_df$value - return_df$value*multi
plot_df(substract)

subtract

Step 4: Find Outliers in Correlated Time Series

The resulting signal looks like “noise”. Plotting the histogram show a normally distributed signal. We already know how to detect anomalies in a normally distributed time series using the 3-sigma rule.

hist(substract$value, breaks = 50)

histogram normal distribution

The 3-sigma rules express that nearly all values are taken to lie within three standard deviations of the mean. Everything outside this range can be treated as anomalous.

#compute 3 sigma minimal & maximal
min = mean(substract$value, na.rm = T) - 3*sd(substract$value, na.rm = T)
max = mean(substract$value, na.rm = T) + 3*sd(substract$value, na.rm = T)

# plot metric & limit
plot_df(substract, NULL, c(-120,320))
abline(h=max, col="#e15f3f", lwd=2)
abline(h=min, col="#e15f3f", lwd=2)

# plot cercle
position = data.frame(id=seq(1, length(substract$value)), value=substract$value, date=substract$date)
anomalyH = position[position$value > max, ]
anomalyH = anomalyH[!is.na(anomalyH$value), ]
anomalyL = position[position$value < min, ]
anomalyL = anomalyL[!is.na(anomalyL$value), ]
anomaly = data.frame(id=c(anomalyH$id, anomalyL$id), value=c(anomalyH$value, anomalyL$value), date=c(anomalyH$date, anomalyL$date))
anomaly = anomaly[!is.na(anomaly$value), ]
points(x = anomaly$date, y=anomaly$value, col="#e15f3f")

correlation anomaly detected

Conclusion

Looking at the initial retention graph, I was able to spot 1 or 2 anomalies when in reality I had 7. I actually remember back in mid-November when I received the automatic anomaly report. I started digging into the problem, and found out that someone had shared my twitter anomaly review on Hacker News.

The post Detecting Anomalies in Correlated Time Series appeared first on Anomaly.

]]>
https://anomaly.io/detect-anomalies-in-correlated-time-series/feed/ 3
Detecting Correlation Among Multiple Time Serieshttps://anomaly.io/detect-correlation-time-series/ https://anomaly.io/detect-correlation-time-series/#comments Thu, 10 Mar 2016 14:09:53 +0000 https://anomaly.io/?p=3166 The post Detecting Correlation Among Multiple Time Series appeared first on Anomaly.

]]>

detect-correlation

To determine the level of correlation between various metrics we often use the normalized cross-correlation formula.2

Definition: Normalized Cross-Correlation

Normalized cross-correlation is calculated using the formula:

$$norm\_corr(x,y)=\dfrac{\sum_{n=0}^{n-1} x[n]*y[n]}{\sqrt{\sum_{n=0}^{n-1} x[n]^2 * \sum_{n=0}^{n-1} y[n]^2}}$$

We recommend first understanding normalized cross correlation before using it, but any statistical language, such as R, can easily compute it for you.

Correlations between 2 metrics

In the following graph, the two metrics show some correlation between each other.

correlation

# plot the graph in R
set.seed(15)
a = c(1,2,-2,4,2,3,1,0,3,4,2,3,1)
b = a + rnorm(length(a), sd = 0.4)
plot(ts(b), col="#f44e2e", lwd=3)
lines(a, col="#27ccc0", lwd=3)

Using R to compute the normalized cross-correlation is as easy as calling the function CCF (for Cross Correlation Functions). By default, CCF plots the correlation between two metrics at different time shifts. It’s easy to understand time shifting, which simply moves the compared metrics to different times. This is useful in detecting when a metric precedes or succeeds another.

# compute using the R language
corr = ccf(a,b)
corr

correlation-level

-8-7-6-5-4-3-2-1012345678
-0.011-0.1460.0610.266-0.025-0.2460.018-0.2930.979-0.2910.098-0.320-0.0430.2880.037-0.1830.083

The last R command displays the correlation between the metrics at various time shift values. As expected, the metrics are highly correlated at time shift 0 (no time shift) with a value of 0.979.

Cluster Correlated Metrics Together

We can also use the CCF function to cluster similar metrics together based how similar they are. To demonstrate this better, we will cluster metrics from a real data set of 45 graphs call “graph45.csv”.

First, we need to compute the correlating level between every possible pair of graphs. This is what the “correlationTable” function does. (To reproduce this example you must download the data set graphs45.csv).

correlationTable = function(graphs) {
  cross = matrix(nrow = length(graphs), ncol = length(graphs))
  for(graph1Id in 1:length(graphs)){
    graph1 = graphs[[graph1Id]]
    print(graph1Id)
    for(graph2Id in 1:length(graphs)) {
      graph2 = graphs[[graph2Id]]
      if(graph1Id == graph2Id){
        break;
      } else {
        correlation = ccf(graph1, graph2, lag.max = 0)
        cross[graph1Id, graph2Id] = correlation$acf[1]
      }
    }
  }
  cross
}

graphs = read.csv("graphs45.csv")
corr = correlationTable(graphs)

It took around 20 seconds to compute all the correlation possibilities between every pair of graphs. The array corr now contains the correlation table; for example, corr[4,3] gives a correlation level of 0.990 between graph4 and graph3. Such a high correlation level indicates a strong correlation between the graphs. To find metrics with sufficiently high correlation, we choose a minimum correlation level of 0.90.

Let’s find and plot all the metrics that strongly correlate with graph4:

findCorrelated = function(orig, highCorr){
 match = highCorr[highCorr[,1] == orig | highCorr[,2] == orig,]
 match = as.vector(match)
 match[match != orig]
}

highCorr = which(corr > 0.90 , arr.ind = TRUE)
match = findCorrelated(4, highCorr)
match # print 6 12 23 42 44 45  3

Success! Graph4 highly correlates with graphs 6, 12, 23, 42, 44, 45 and 3.

Let’s now plot all the graphs together:

correlate-graphs-cluster

bound = function(graphs, orign, match) {
  graphOrign = graphs[[orign]]
  graphMatch = graphs[match]
  allValue = c(graphOrign)
  for(m in graphMatch){
    allValue = c(allValue, m)
  }
  c(min(allValue), max(allValue))
}

plotSimilar = function(graphs, orign, match){
  lim = bound(graphs, orign, match)

  graphOrign = graphs[[orign]]
  plot(ts(graphOrign), ylim=lim, xlim=c(1,length(graphOrign)+25), lwd=3)
  title(paste("Similar to", orign, "(black bold)"))

  cols = c()
  names = c()
  for(i in 1:length(match)) {
    m = match[[i]]
    matchGraph = graphs[[m]]
    lines(x = 1:length(matchGraph), y=matchGraph, col=i)

    cols = c(cols, i)
    names = c(names, paste0(m))
  }
  legend("topright", names, col = cols, lty=c(1,1))
}

plotSimilar(graphs, 4, match)

Conclusion

Here at anomaly.io, finding cross-correlation is one of the first steps in detecting unusual patterns in your data. Subtracting two correlated metrics should result in an almost flat signal. If suddenly the flat signal (or the gap between the curves) hits a certain level, you can trigger an anomaly. Of course this is an oversimplification and the reality is much more complex, but it’s a good foundation to work from.

The post Detecting Correlation Among Multiple Time Series appeared first on Anomaly.

]]>
https://anomaly.io/detect-correlation-time-series/feed/ 3
Understanding Cross-Correlation, Auto-Correlation, Normalization and Time Shifthttps://anomaly.io/understand-auto-cross-correlation-normalized-shift/ https://anomaly.io/understand-auto-cross-correlation-normalized-shift/#comments Tue, 08 Mar 2016 20:04:04 +0000 https://anomaly.io/?p=3141 The post Understanding Cross-Correlation, Auto-Correlation, Normalization and Time Shift appeared first on Anomaly.

]]>

normalized-cross-correlation-auto-shift

To detect the correlation of time series we often use auto-correlation, cross-correlation or normalized cross-correlation. Let’s study these techniques to understand them better.

Definition:

  • Cross-correlation is the comparison of two different time series to detect if there is a correlation between metrics with the same maximum and minimum values. For example: “Are two audio signals in phase?”
  • Normalized cross-correlation is also the comparison of two time series, but using a different scoring result. Instead of simple cross-correlation, it can compare metrics with different value ranges. For example: “Is there a correlation between the number of customers in the shop and the number of sales per day?”
  • Auto-correlation is the comparison of a time series with itself at a different time. It aims, for example, to detect repeating patterns or seasonality. For example: “Is there weekly seasonality on a server website?” “Does the current week’s data highly correlate with that of the previous week?”
  • Normalized auto-correlation is the same as normalized cross-correlation, but for auto-correlation, thus comparing one metric with itself at a different time.
  • Time Shift can be applied to all of the above algorithms. The idea is to compare a metric to another one with various “shifts in time”. Applying a time shift to the normalized cross-correlation function will result in a “normalized cross-correlation with a time shift of X”. This can be used to answer questions such as: “When many customers come in my shop, do my sales increase 20 minutes later?”

Cross-Correlation

To detect a level of correlation between two signals we use cross-correlation. It is calculated simply by multiplying and summing two-time series together.

In the following example, graphs A and B are cross-correlated but graph C is not correlated to either.

similar

# plot the graph in R
a = c(1,2,-2,4,2,3,1,0)
b = c(2,3,-2,3,2,4,1,-1)
c = c(-2,0,4,0,1,1,0,-2)

plot(ts(a), col="#f44e2e", lwd=2)
lines(b, col="#27ccc0", lwd=2)
lines(c, col="#273ecc", lwd=2)
legend("topright", c("a","b","c"), 
       col=c("#f44e2e","#27ccc0","#273ecc"), lty=c(1), lwd = 2)

Using the cross-correlation formula above we can calculate the level of correlation between series.

$$corr(x, y) = \sum_{n=0}^{n-1} x[n]*y[n]$$

$$\begin{align}
corr(a, b) & = 1*2+2*3+-2*-2+4*3+2*2+3*4+1*1+0*-1 \\
& = 41
\end{align}$$

$$\begin{align}
corr(a, c) & =1*-2+2*0+-2*4+4*0+2*1+3*1+1*0+0*-2 \\
& =-5
\end{align}$$

Graphs A and B correlate, with a high value of 41.
Graphs A and C don’t correlate, having a low value of -5.

# compute using the R language
corr_ab = sum(a*b) # equal 41
corr_ac = sum(a*c) # equal -5

Normalized Cross-Correlation

There are three problems with cross-correlation:

  1.  It is difficult to understand the scoring value.
  2. Both metrics must have the same amplitude. If Graph B has the same shape as Graph A but values two times smaller, the correlation will not be detected.
    corr(a, a/2) =  19.5
  3. Due to the formula, a zero value will not be taken into account, since 0*0=0 and 0*200=0.

To solve these problems we use normalized cross-correlation:

$$norm\_corr(x,y)=\dfrac{\sum_{n=0}^{n-1} x[n]*y[n]}{\sqrt{\sum_{n=0}^{n-1} x[n]^2 * \sum_{n=0}^{n-1} y[n]^2}}$$

Using this formula let’s compute the normalized cross-correlation of AB and AC.

$$\begin{align}
norm\_corr(a,b) &= \dfrac{1*2+2*3+-2*-2+4*3+2*2+3*4+1*1+0*-1}{\sqrt{(1+4+4+16+4+9+1+0)*(4+9+4+9+4+16+1+1)}} \\
& = \dfrac{41}{\sqrt{(39)*(48)}} \\
& = 0.947
\end{align}$$

$$\begin{align}
norm\_corr(a,c) & =\dfrac{1*-2+2*0+-2*4+4*0+2*1+3*1+1*0+0*-2}{\sqrt{(1+4+4+16+4+9+1+0)*(4+0+16+0+1+1+0+4)}} \\
& =\dfrac{-5}{\sqrt{(39)*(26)}} \\
& =-0.157
\end{align}$$

Graphs A and B correlate, with a high value of 0.947.
Graphs A and C don’t correlate, showing a low value of -0.157.

  • Normalized cross-correlation scoring is easy to understand:
    – The higher the value, the higher the correlation is.
    – The maximum value is 1 when two signals are exactly the same:
    norm_corr(a,a)=1
    – The minimum value is -1 when two signals are exactly opposite:
    norm_corr(a, -a) = -1
  • Normalized cross-correlation can detect the correlation of two signals with different amplitudes: norma_corr(a, a/2) = 1.
    Notice we have perfect correlation between signal A and the same signal with half the amplitude!

# compute using the R language
norm_corr_ab = sum(a*b) / sqrt(sum(a^2)*sum(b^2)) #equal 0.947
norm_corr_ac = sum(a*c) / sqrt(sum(a^2)*sum(c^2)) #equal -0.157

Auto-Correlation

Auto-correlation is very useful in many applications; a common one is detecting repeatable patterns due to seasonality.

The following graph clearly shows repeating patterns every 8 data points. Indeed, looking at the R code, it’s a repeatable sequence of the numbers 1 through 8 with some random noise in the mix.

auto-correlation

# compute using the R language
set.seed(5)
ar = rep(c(1,2,3,4,5,6,7,8), 8) + rnorm(8*8, sd = 0.7)
plot(ts(ar), col="#f44e2e", lwd=2)

Let’s compute the auto-correlation between the signal and itself at a time shift of 4 and time shift of 8. The following graphs clearly show a high auto-correlation at time shift 8, but not at time shift 4.

no-auto-correlation auto-correlation

# plot using the R language
ar4 = ar[1:(length(ar)-4)]
ar4_shift = ar[5:length(ar)]
plot(ts(ar4), col="#f44e2e", lwd=2, xlim=c(0,78))
lines(ar4_shift, col="#27ccc0", lwd=2)
legend("topright", c("original","shift 4"), col = c("#f44e2e","#27ccc0"), lty=c(1,1))

ar8 = ar[1:(length(ar)-8)]
ar8_shift = ar[9:length(ar)]
plot(ts(ar8), col="#f44e2e", lwd=2, xlim=c(0,72))
lines(ar8_shift, col="#27ccc0", lwd=2)
legend("topright", c("original","shift 8"), col = c("#f44e2e","#27ccc0"), lty=c(1,1))

To compute, we can use the same formula as cross-correlation (see above).

# compute using the R language
corr_arar4 = sum(ar4*ar4_shift) #equals 1130.705
corr_arar8 = sum(ar8*ar8_shift) #equals 1456.428

For a time shift of 8 the auto-correlation is higher than a time shift of 4. We have detected seasonality with a period of 8.

Normalized Auto-Correlation

We discussed earlier the advantages of normalized cross-correlation. In the same way, we can compute the normalized auto-correlation with time shifts of 4 and 8:

# compute using the R language
norm_auto_arar4 = sum(ar4*ar4_shift) / sqrt(sum(ar4^2)*sum(ar4_shift^2)) #equal 0.726
norm_auto_arar8 = sum(ar8*ar8_shift) / sqrt(sum(ar8^2)*sum(ar8_shift^2)) #equal 0.981

Normalized cross-correlation makes it very obvious that the signal repeats in a similar manner every 8 data points.

Correlation with Time Shift

All correlation techniques can be modified by applying a time shift. For example, it is very common to perform a normalized cross-correlation with time shift to detect if a signal “lags” or “leads” another.

To process a time shift, we correlate the original signal with another one moved by x elements to the right or left. Just as we did for auto-correlation.

To detect if two metrics are correlated with a time shift we need to compute all the possible time shifts. Fortunately, the R language can compute all the correlations with time shift very quickly.

Normalized Cross-Correlation with Time Shift

# using R language 
library(stats)

# Normalized Cross-Correlation for lags from -4 to 4
a = c(0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0,1,2,3,4)
b = c(1,2,3,3,0,1,2,3,4,0,1,1,4,4,0,1,2,3,4,0)

#show graph
plot(ts(a), col="#f44e2e", lwd=2)
lines(b, col="#27ccc0", lwd=2)
legend("topright", c("a","b"), 
       col=c("#f44e2e","#27ccc0"), lty=c(1), lwd = 2)

r = ccf(a,b, lag.max = 4)
r #show correlation values

cross-correlation

ACF-lag

-4-3-2-101234
0.8620.021-0.547-0.4230.0000.8670.127-0.466-0.393

As we expected from the graph above, the metrics highly correlate with a time shift of 1.

Normalized Auto-Correlation with Time Shift

# using R language 
library(stats)

# Normalized Auto-Correlation for lags from -10 to 10
set.seed(5)
ar = rep(c(1,2,3,4,5,6,7,8), 8) + rnorm(8*8, sd = 0.7)

#display
plot(ts(ar), col="#f44e2e", lwd=2)

r = acf(ar, lag.max = 10)
r # show correlation values

auto-correlation

auto-correlation with time shift

012345678910
1.0000.335-0.122-0.304-0.369-0.374-0.2260.1870.7890.306-0.120

The output above repeats every 8 datapoints. As expected, the auto-correlation detects a high correlation when the series is compared to itself at a time shift of 8.

Conclusion

Here at anomaly.io, we commonly use both cross-correlation and auto-correlation, which are building blocks to detecting unusual patterns in your data. As auto-correlation can detect the seasonality of a metric, we can apply a range of anomaly detection algorithms such as seasonal decomposition of time series or seasonally adjusting a time series. When a cross-correlation is found, we can detect anomalies when the correlation is broken between the series.

Be sure to also read “Detecting Correlation Among Time Series“.

The post Understanding Cross-Correlation, Auto-Correlation, Normalization and Time Shift appeared first on Anomaly.

]]>
https://anomaly.io/understand-auto-cross-correlation-normalized-shift/feed/ 0
Change Point Detection with Seasonal Time Serieshttps://anomaly.io/change-point-detection-seasonal/ https://anomaly.io/change-point-detection-seasonal/#comments Mon, 08 Feb 2016 10:26:58 +0000 https://anomaly.io/?p=3076 The post Change Point Detection with Seasonal Time Series appeared first on Anomaly.

]]>

change-point-seasonality

Previously, we looked at using Twitter Breakout (EDM) to detect Anomalies. As with the popular E-Divisive, EDM detects mean shift and changes in distribution. Both algorithms work with seasonal time series, but perform even better without seasonality.

Change Point Doesn’t Work (Well) with Seasonality

To demonstrate the “weakness” of change point, let’s generate some fake seasonal time series. Then we will try to detect anomalies using two different change point detection algorithms: EDM and E-Divisive.

# install E-Divisive and EDM
install.packages("Rcpp")
install.packages("ecp")
install.packages("devtools")
devtools::install_github("twitter/BreakoutDetection")
library(Rcpp)
library(ecp)
library(BreakoutDetection)

createDay = function(noise=0) {
 point=seq(0, pi, by=0.02)
 connection=sin(point)
 noise = rnorm(length(point), sd = noise)
 return(connection+noise)
}

createDays = function(totalDays, noise=0) {
 allDays = c()
 for (day in 1:totalDays ) {
 allDays = c(allDays, createDay(noise))
 }
 return(allDays)
}

set.seed(1234)
p1 = createDays(3, 0.2)
anomaly = createDays(1, 0)*2 + rnorm(158, sd = 0.07)
days = c(p1, anomaly)
plot(as.ts(days))

# EDM - fail
res = breakout(days, min.size=158, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

# E-Divisive - fail
ediv = e.divisive(as.matrix(days), min.size=158, alpha=1)
plot(as.ts(days))
abline(v=ediv$estimates,col="blue")

fail-change-point-EDM fail-change-point-E-Divisive

As we can see, due to the seasonality of the time series, traditional change point detection doesn’t work very well.

Removing the Seasonality

To use change point detection effectively, we need to remove the seasonality from our time series. And to do that, we need to know the period of the seasonality. In this case, we know the seasonality to be 158 data points per day. If we don’t know, it’s possible to calculate the seasonality using a Fourier Transform. We also need to know if the time series is multiplicative or additive. In our example, I have an additive time series, but most of the time it is multiplicative. With this information, we can now decompose to remove the seasonality. Finally, we can run the change point detection again to get a successful result.

#remove seasonal
decomposed_days = decompose(ts(days, frequency = 158), "additive")
noSeasonal = days - decomposed_days$seasonal

#remove seasonal
decomposed_days = decompose(ts(p1, frequency = 158), "additive")
noSeasonal = days - rep(decomposed_days$seasonal[0:158],4)

# EDM - success
res = breakout(noSeasonal, min.size=158, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

# E-Divisive - success
ediv = e.divisive(as.matrix(noSeasonal), min.size=158, alpha=1)
plot(as.ts(noSeasonal))
abline(v=ediv$estimates,col="blue")

success-EDM success-E-Divisive

With the seasonality removed, breakout EDM and E-Divisive work a little bit better.

The post Change Point Detection with Seasonal Time Series appeared first on Anomaly.

]]>
https://anomaly.io/change-point-detection-seasonal/feed/ 0
Anomaly Detection with Twitter Breakouthttps://anomaly.io/anomaly-detection-using-twitter-breakout/ https://anomaly.io/anomaly-detection-using-twitter-breakout/#comments Wed, 20 Jan 2016 14:06:02 +0000 https://anomaly.io/?p=2873 The post Anomaly Detection with Twitter Breakout appeared first on Anomaly.

]]>

AnomalyDetectionBreakout

We previously tested Twitter Anomaly Detection package using the R language. Now let’s take a look at Twitter Breakout Detection.

What is Twitter Breakout Detection?

This Twitter package is intended to detect changes in time series. It is describe as an E-Divisive with Medians (EDM). It is supposed to:

To detect divergence the algorithm uses a clustering technique (mean shift clustering). As EDM is non-parametric, the data doesn’t need to follow any specific distribution; it can adapt to the current distribution. As a result, it can be used to detect when the distribution changes.

We previously explained how moving medians are robust to anomalies. The 3.5x speed improvement compare to E-Divisive is in part due to the use of Interval Trees to approximate the median very efficiently. However, this isn’t the only reason for the speed increase.

Let’s try it out ourselves.

Divergence Detection (Mean Shift, Ramp Up)

Mean Shift

A sudden jump in the time series is call a mean shift and represents the time series switching from one steady state to another. A good example (used by Twitter) is to imagine CPU utilization suddenly jumping from 40% to 60%. Let’s generate a fake time series to test the detection process.

# install BreakoutDetection
install.packages("devtools")
devtools::install_github("twitter/BreakoutDetection")
library(BreakoutDetection)

set.seed(123)
p1 = rnorm(60, mean = 0, sd = .4)
p2 = rnorm(60, mean = 6, sd = .4)
p3 = rnorm(60, mean = 0, sd = .4)
p4 = rnorm(60, mean = 3, sd = .4)
p5 = rnorm(60, mean = 0, sd = .4)
p6 = rnorm(60, mean = 6, sd = .4)
p7 = rnorm(60, mean = 0, sd = .4)
all = c(p1,p2,p3,p4,p5,p6,p7)
plot(as.ts(all), col = "#27ccc0", lwd = 4)

res = breakout(all, min.size=20, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

meanShift

Indeed, we were able to detect 7 mean divergences, also called break downs.
A simple help(breakout) command tells us more about the package.

min.size: The minimum number of observations between change points.
method: Method must be one of either “amoc” (At Most One Change) or “multi” (Multiple Changes). For “amoc.” at most one change point location will be returned.
degree: The degree of the penalization polynomial. Degree can take the values 0, 1, and 2. The default value is 1.
beta: A real number constant used to further control the amount of penalization. This is the default form of penalization; it will be used if neither beta nor percent are supplied. The default value is beta=0.008.

Ramp Up

A ramp up is a slow transition from one steady state to another. An example would be CPU utilization slowly transitioning from 40% to 60% over time. Let’s detect this kind of divergence.

ramp = 40
set.seed(1234)
p1 = rnorm(100, mean = 0, sd = .4)
p2 = rnorm(ramp, mean = 6, sd = 1) * seq(0, 1, length.out = ramp)
p3 = rnorm(100, mean = 6, sd = .4)
p4 = rnorm(ramp, mean = 6, sd = 1) * seq(1, 0, length.out = ramp)
p5 = rnorm(100, mean = 0, sd = .4)
p6 = rnorm(ramp, mean = 6, sd = 1) * seq(0, 1, length.out = ramp)
p7 = rnorm(100, mean = 6, sd = .4)
p8 = rnorm(ramp, mean = 6, sd = 1) * seq(1, 0, length.out = ramp)
p9 = rnorm(100, mean = 0, sd = .4)
all = c(p1,p2,p3,p4, p5,p6,p7,p8,p9)
plot(as.ts(all), col = "#27ccc0", lwd = 2)

res = breakout(all, min.size=20, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

ramp-up

Even with the ramp up, breakdown still detects 7 distinct states. It’s very powerful to be able to see the ramp up as a transition and not as a breakdown.

Detect Changes in Distribution

EDM doesn’t make assumptions about the distribution of the time series. Instead, it learns the current distribution and uses it as a reference. When the distribution suddenly changes, EDM can detect the variation.

set.seed(123456) 
p1 = rnorm(1000, mean = 1)
p2 = rgamma(1000, shape = 1)
p3 = rpois(1000, lambda = 3)/2 - .5
p4 = runif(1000) + .5
p5 = rexp(1000)
p6 = rweibull(1000, shape = 1)

all = c(p1, p2, p3, p4, p5, p6)
plot(as.ts(all))
abline(v=seq(100,1000*6,100),col="blue")

res = breakout(all, min.size=200, method='multi', beta=0.00018, degree=1, plot=TRUE)
res$plot

distribution-change

In the example above we switch alternatively from the Normal distribution to the Gamma, Poisson, Uniform, Exponential and Weibull distributions. With some (difficult) fine tuning, EDM can detect the change from Normal to Gamma distributions and from Uniform to Exponential. Unfortunately, it doesn’t detect the change from Poisson to Uniform distribution or from Exponential to Weibull. We believe that this is due to its ability to perform in the presence of anomalies as the running Median removes outliers. Without anomalies, E-Divisive performs better. The blue line shows where E-Divisive notifies that there has been a change in distribution.

#install package
install.packages("ecp")
library(ecp)

# ~ 30 min on 2.6Ghz CPU
ediv = e.divisive(as.matrix(all), min.size=200, alpha=1)
plot(as.ts(all))
abline(v=ediv$estimates,col="blue")

distribution-change-E-Divisive

3.5× Greater Speed Than Other Breakout Detection Methods

E-Divisive detects changes in distribution as soon as they occur, but is very slow compared to the EDM algorithm. One of the reasons why EDM is much faster is due to the use of interval trees to approximate the median.

Data pointsE-DivisiveEDM
600029 min24 s
600 7 s1 s

Based on only two tests, we can see that EDM performs much faster than E-Divisive.

Robustness in the Presence of Anomalies

As explained earlier, EDM stand for E-Divisive with Medians. In a previous article, we showed how the running median is robust to anomalies. As a result, an anomaly in the time series isn’t detected as a mean shift.

set.seed(12345)
p = rnorm(20, mean = 1)
all = c(p, 25, p, 30, p, 10, p, 20, p, 70, p , 50, p) #Add Anomalies

res = breakout(all, min.size=40, method='multi', beta=0.001, degree=1, plot=TRUE)
res$plot

anomaly-not-detected

Anomalies aren’t detected as “breakouts” or mean shifts. EDM is, indeed, robust to anomalies.

The post Anomaly Detection with Twitter Breakout appeared first on Anomaly.

]]>
https://anomaly.io/anomaly-detection-using-twitter-breakout/feed/ 0
Detecting Anomalies with Moving Median Decompositionhttps://anomaly.io/anomaly-detection-moving-median-decomposition/ https://anomaly.io/anomaly-detection-moving-median-decomposition/#comments Tue, 12 Jan 2016 10:12:21 +0000 https://anomaly.io/?p=3090 The post Detecting Anomalies with Moving Median Decomposition appeared first on Anomaly.

]]>

anomaly-decomposition-median

Time series decomposition splits a time series into seasonal, trend and random residual time series. The trend and the random time series can both be used to detect anomalies. But detecting anomalies in an already anomalous time series isn’t easy.

TL;DR

When working on an anomalous time series:

 

The Problem with Moving Averages

In the blog entry on time series decomposition in R, we learned that the algorithm uses a moving average to extract the trends of time series. This is perfectly fine in time series without anomalies, but in the presence of outliers, the moving average is seriously affected, because the trend embeds the anomalies.

In this demonstration, we will first detect anomalies using decomposition with a moving average. We’ll see that this doesn’t work well, and so will try detecting anomalies using decomposition with a moving median to get better results.

About the data: webTraffic.csv reports the number of page view per day over a period of 103 weeks (almost 2 years). To make the data more interesting, we added some (extra) anomalies to it. Looking at the time series, we clearly see a seasonality of 7 days; there is less traffic on weekends.

To decompose a seasonal time series, the seasonality time period is needed. In our example, we know the seasonality to be 7 days. If unknown, it is possible to determine the seasonality of a time series. Last but not least, we need to know if the time series is additive or multiplicative. Our web traffic is multiplicative.

To sum up our web traffic:

  • Seasonality of 7 days (over 103 weeks)
  • Multiplicative time series
  • Download here: webTraffic.csv

set.seed(4)
data <- read.csv("webTraffic.csv", sep = ",", header = T)
days = as.numeric(data$Visite)
for (i in 1:45 ) {
 pos = floor(runif(1, 1, 50))
 days[i*15+pos] = days[i*15+pos]^1.2
}
days[510+pos] = 0
plot(as.ts(days))

webTraffic-anomalies

Moving Average Decomposition (Bad Results)

Before going any further, make sure to import the data. Also, I recommend being sure that you understand how time series decomposition works. The stats package provides the handy decompose function in R.

1 – Decomposition

As the time series is anomalous during the decomposition, the trends become completely wrong. Indeed, the anomalies are averaged into the trend.

# install required library
install.packages("FBN")
library(FBN)

decomposed_days = decompose(ts(days, frequency = 7), "multiplicative")
plot(decomposed_days)

moving-average-decomposition

2 – Using Normal Distribution to Find Minimum and Maximum

To the random noise we can apply the normal distribution to detect anomalies. Values under or over 4 times the standard deviation can be considered outliers.

random = decomposed_days$random
min = mean(random, na.rm = T) - 4*sd(random, na.rm = T)
max = mean(random, na.rm = T) + 4*sd(random, na.rm = T)

plot(as.ts(as.vector(random)), ylim = c(-0.5,2.5))
abline(h=max, col="#e15f3f", lwd=2)
abline(h=min, col="#e15f3f", lwd=2)

moving-average-limits

3 – Plotting Anomalies

Let’s find the values either over or under 4 standard deviations to plot the anomalies. As expected, the calculation fails to find many anomalies, due to the use of moving average by the decomposition algorithm.

#find anomaly
position = data.frame(id=seq(1, length(random)), value=random)
anomalyH = position[position$value > max, ]
anomalyH = anomalyH[!is.na(anomalyH$value), ]
anomalyL = position[position$value < min, ]
anomalyL = anomalyL[!is.na(anomalyL$value), ]
anomaly = data.frame(id=c(anomalyH$id, anomalyL$id),
 value=c(anomalyH$value, anomalyL$value))
anomaly = anomaly[!is.na(anomaly$value), ]

plot(as.ts(days))
real = data.frame(id=seq(1, length(days)), value=days)
realAnomaly = real[anomaly$id, ]
points(x = realAnomaly$id, y =realAnomaly$value, col="#e15f3f")

moving-average-anomalies

Moving Average Decomposition (Good Results)

In our second approach, we will do the decomposition using a moving median instead of a moving average. This requires understanding how the moving median is robust to anomalies and how time series decomposition works. Also, again make sure to import the data. The moving median removes the anomalies without altering the time series too much.

1 – Applying the Moving Median

The decomposed trend with moving median (see above graphic) is a bit “raw” but expresses the trend better.

#install library
install.packages("forecast")
library(forecast)
library(stats)

trend = runmed(days, 7)
plot(as.ts(trend))

moving-median-trends

2 – Using the Normal Distribution to Find the Minimum and Maximum

As the decomposition formula expresses, removing the trend and seasonality from the original time series leaves random noise. As it should be normally distributed, we can apply the normal distribution to detect anomalies. As we saw previously, values under or over 4 times the standard deviation can be considered outliers. As random time series still contain the anomalies, we need to estimate the standard deviation without taking the anomalies into account. Once again, we will use the moving median to exclude the outliers.

detrend = days / as.vector(trend)
m = t(matrix(data = detrend, nrow = 7))
seasonal = colMeans(m, na.rm = T)
random = days / (trend * seasonal)
rm_random = runmed(random[!is.na(random)], 3)

min = mean(rm_random, na.rm = T) - 4*sd(rm_random, na.rm = T)
max = mean(rm_random, na.rm = T) + 4*sd(rm_random, na.rm = T)
plot(as.ts(random))
abline(h=max, col="#e15f3f", lwd=2)
abline(h=min, col="#e15f3f", lwd=2)

moving-median-limits

3 – Plotting Anomalies

As before, values over or under 4 times the standard deviation are plotted as anomalous. This works much better; we detect almost every anomaly! Once again, this is because the moving median is robust to anomalies.

#find anomaly
position = data.frame(id=seq(1, length(random)), value=random)
anomalyH = position[position$value > max, ]
anomalyH = anomalyH[!is.na(anomalyH$value), ]
anomalyL = position[position$value < min, ]
anomalyL = anomalyL[!is.na(anomalyL$value)]
anomaly = data.frame(id=c(anomalyH$id, anomalyL$id),
 value=c(anomalyH$value, anomalyL$value))
points(x = anomaly$id, y =anomaly$value, col="#e15f3f")

plot(as.ts(days))
real = data.frame(id=seq(1, length(days)), value=days)
realAnomaly = real[anomaly$id, ]
points(x = realAnomaly$id, y =realAnomaly$value, col="#e15f3f")

moving-median-anomalies

Here is an alternative plot displaying the same results: values outside the blue area can be considered anomalous.

high = (trend * seasonal) * max
low = (trend * seasonal) * min
plot(as.ts(days), ylim = c(0,200))
lines(as.ts(high))
lines(as.ts(low))

x = c(seq(1, length(high)), seq(length(high), 1))
y = c(high, rev(low))
length(x)
length(y)
polygon(x,y, density = 50, col='skyblue')

moving-median-limits2

The post Detecting Anomalies with Moving Median Decomposition appeared first on Anomaly.

]]>
https://anomaly.io/anomaly-detection-moving-median-decomposition/feed/ 6
How to Seasonally Adjust a Time Series in Rhttps://anomaly.io/seasonally-adjustement-in-r/ https://anomaly.io/seasonally-adjustement-in-r/#comments Thu, 03 Dec 2015 20:14:07 +0000 https://anomaly.io/?p=2956 The post How to Seasonally Adjust a Time Series in R appeared first on Anomaly.

]]>

seasonal-adjustment

What is seasonal adjustment? Why is it important, and when should we use it? How should it be done? In this article, we will answer these and many other questions related to seasonality in data.

Why Seasonally Adjust Data ?

Many industries experience fluctuations in various metrics based on the time of year. This means that it is not possible to effectively assess performance by comparing data from one time of year to data from another. Furthermore, these seasonal fluctuations can sometimes be so large that they can mask important business trends hiding in the data.

Consider the sales of an ice cream parlor, for example. It is normal for this sort of business to have higher sales during the warm summer months than during the winter, so you cannot compare January’s sales figures to those from July. Furthermore, the oscillations due to seasonality can be so large that they make business trends hard to see. If sales are up in the most recent month of July, for example, was this due to seasonal variation or an actual increase in sales? To get these answers we need to remove “seasonality” from the data, a process called seasonal adjustment.

Airline Passenger Numbersair-passager-1949-1960

Airline Passenger Numbers – Seasonally Adjustedseasonaly-adjusted-air-passager-1949-1960

The images above show airline passenger numbers from 1949 to 1960. The chart on the left shows the raw data, which illustrates substantial seasonality: people travel more in the summer. The seasonal variations make it difficult to observe potentially noteworthy fluctuations. The seasonally adjusted chart on the right has removed the seasonal patterns from the raw data, allowing us to see the actual trend in passenger activity over time. In so doing, we’ve exposed an anomaly in the year 1960 that was not readily noticeable in the raw data.

Airline Passenger Numbers
air-passager-1996-2015

Airline Passenger Numbers – Seasonally Adjustedseasonaly-adjusted-air-passager-1996-2015

Sometimes an anomaly is so large that it is obvious even in the raw data. For example, the graph above left shows airline passenger numbers from 1996 to 2015. The severe drop in late 2001 was due to the 9/11 attack on the United States, which caused all of North American air space to be closed for several days. But the seasonally adjusted time series on the right makes some more subtle variances easier to see, such as the dip starting in 2008 caused by the Great Recession.

How to Seasonally Adjust Time Series

To seasonally adjust a time series we must first find the seasonality. Performing a time series decomposition will “break down” a time series into multiple sub-time series, one of which will contain the seasonality. The decompose( ) function in R does the heavy lifting here, but there are two requirements to use this function:

  • You must know if you’re dealing with an additive or multiplicative model; read about time series decomposition to know which one to use.
  • You must know the period of the seasonality. This may be a fixed period such as daily, weekly, monthly, or yearly. If unknown, it is also possible to detect the seasonality mathematically.

Another article can help you understand how to perform a time series decomposition by explaining seasonal adjustment formulas:

Additive:
Seasonally Adjusted = Time series – Seasonal

Multiplicative:
Seasonally Adjusted = Time series / Seasonal

install.packages("fpp")
library(fpp)
data(ausbeer)
ts_beer = ts(ausbeer, frequency = 4, start = 1956)

decompose_beer = decompose(ts_beer, "additive")
adjust_beer = ts_beer - decompose_beer$seasonal
plot(adjust_beer)

This seasonality is annual with four data points per years (frequency = 4). As an example, here is the seasonally adjusted time series for Australian beer production:

Australian Beer Production – Seasonally Adjustedseasonaly-adjusted-beer

install.packages("Ecdat")
library(Ecdat)
data(AirPassengers)
ts_air = ts(AirPassengers, frequency = 12, start = 1949)

decompose_air = decompose(ts_air, "multiplicative")
adjust_air = ts_air / decompose_air$seasonal
plot(adjust_air, col = color, lwd = 2)

This is annual seasonality with 12 data points per year (frequency = 12). Here again, as an example, is the seasonally adjusted time series for airline passengers:

Airline Passenger Numbers – Seasonally Adjustedseasonaly-adjusted-air-passager-1996-2015

Conclusion

Seasonally adjusted time series provide a way to understand the underlying trends in data by removing the “noise” of seasonal fluctations so outliers and anomalies are easier to see. Just as removing seasonality makes problems easier to spot with your eyes, it also makes them easier for the computer. This is one of many techniques we use at Anomaly.io to automatically detect problems in monitored data.

The post How to Seasonally Adjust a Time Series in R appeared first on Anomaly.

]]>
https://anomaly.io/seasonally-adjustement-in-r/feed/ 1
Extracting Seasonality and Trend from Data: Decomposition Using Rhttps://anomaly.io/seasonal-trend-decomposition-in-r/ https://anomaly.io/seasonal-trend-decomposition-in-r/#comments Tue, 01 Dec 2015 14:12:02 +0000 https://anomaly.io/?p=2875 The post Extracting Seasonality and Trend from Data: Decomposition Using R appeared first on Anomaly.

]]>

time-series-decomposition-seasonal-trend

Time series decomposition works by splitting a time series into three components: seasonality, trends and random fluctiation. To show how this works, we will study the decompose( ) and STL( ) functions in the R language.

Understanding Decomposition

Decompose One Time Series into Multiple Series

Time series decomposition is a mathematical procedure which transforms a time series into multiple different time series. The original time series is often split into 3 component series:

  • Seasonal: Patterns that repeat with a fixed period of time. For example, a website might receive more visits during weekends; this would produce data with a seasonality of 7 days.
  • Trend: The underlying trend of the metrics. A website increasing in popularity should show a general trend that goes up.
  • Random: Also call “noise”, “irregular” or “remainder,” this is the residuals of the original time series after the seasonal and trend series are removed.

Additive or Multiplicative Decomposition?

To achieve successful decomposition, it is important to choose between the additive and multiplicative models, which requires analyzing the series. For example, does the magnitude of the seasonality increase when the time series increases?

additive-modelAustralian beer production – The seasonal variation looks constant; it doesn’t change when the time series value increases. We should use the additive model.

multiplicative-modelAirline Passenger Numbers – As the time series increases in magnitude, the seasonal variation increases as well. Here we should use the multiplicative model.

Additive:
Time series = Seasonal + Trend + Random

Multiplicative:
Time series = Trend * Seasonal *Random

The decomposition formula varies a little based on the model.

Step-by-Step: Time Series Decomposition

We’ll study the decompose( ) function in R. As a decomposition function, it takes a time series as a parameter and decomposes it into seasonal, trend and random time series. We’ll reproduce step-by-step the decompose( ) function in R to understand how it works. Since there are variations between the two models, we’ll use two examples: Australian beer production (additive) and airline passenger numbers (multiplicative).

Step 1: Import the Data

Additive

As mentioned previously, a good example of additive time series is beer production. As the metric values increase, the seasonality stays relatively constant.

Multiplicative

Monthly airline passenger figures are a good example of a multiplicative time series. The more passengers there are, the more seasonality is observed.

install.packages("fpp")
library(fpp)
data(ausbeer)
timeserie_beer = tail(head(ausbeer, 17*4+2),17*4-4)
plot(as.ts(timeserie_beer))

additive-model

install.packages("Ecdat")
library(Ecdat)
data(AirPassengers)
timeserie_air = AirPassengers
plot(as.ts(timeserie_air))

multiplicative-model

Step 2: Detect the Trend

To detect the underlying trend, we smoothe the time series using the “centred moving average“. To perform the decomposition, it is vital to use a moving window of the exact size of the seasonality. Therefore, to decompose a time series we need to know the seasonality period: weekly, monthly, etc… If you don’t know this figure, you can detect the seasonality using a Fourier transform.

Additive

Australian beer production clearly follows annual seasonality. As it is recorded quarterly, there are 4 data points recorded per year, and we use a moving average window of 4.

Multiplicative

The process here is the same as for the additive model. Airline passenger number seasonality also looks annual. However, it is recorded monthly, so we choose a moving average window of 12.

install.packages("forecast")
library(forecast)
trend_beer = ma(timeserie_beer, order = 4, centre = T)
plot(as.ts(timeserie_beer))
lines(trend_beer)
plot(as.ts(trend_beer))

additive-moving-average
additive-trend

install.packages("forecast")
library(forecast)
trend_air = ma(timeserie_air, order = 12, centre = T)
plot(as.ts(timeserie_air))
lines(trend_air)
plot(as.ts(trend_air))

multiplicative-moving-average
multiplicative-trend

Step 3: Detrend the Time Series

Removing the previously calculated trend from the time series will result into a new time series that clearly exposes seasonality.

Additive

Multiplicative

detrend_beer = timeserie_beer - trend_beer
plot(as.ts(detrend_beer))

additive-detrend

detrend_air = timeserie_air / trend_air
plot(as.ts(detrend_air))

multiplicative-detrend

Step 4: Average the Seasonality

From the detrended time series, it’s easy to compute the average seasonality. We add the seasonality together and divide by the seasonality period. Technically speaking, to average together the time series we feed the time series into a matrix. Then, we transform the matrix so each column contains elements of the same period (same day, same month, same quarter, etc…). Finally, we compute the mean of each column. Here is how to do it in R:

Additive

Quarterly seasonality: we use a matrix of 4 rows. The average seasonality is repeated 16 times to create the graphic to be compared later (see below)

Multiplicative

Monthly seasonality: we use a matrix of 12 rows.
The average seasonality is repeated 12 times to create the graphic we will compare later (see below)

m_beer = t(matrix(data = detrend_beer, nrow = 4))
seasonal_beer = colMeans(m_beer, na.rm = T)
plot(as.ts(rep(seasonal_beer,16)))

additive-seasonality

m_air = t(matrix(data = detrend_air, nrow = 12))
seasonal_air = colMeans(m_air, na.rm = T)
plot(as.ts(rep(seasonal_air,12)))

multiplicative-sesonal

Step 5: Examining Remaining Random Noise

The previous steps have already extracted most of the data from the original time series, leaving behind only “random” noise.

Additive

The additive formula is “Time series = Seasonal + Trend + Random”, which means “Random = Time series – Seasonal – Trend”

Multiplicative

The multiplicative formula is “Time series = Seasonal * Trend * Random”, which means “Random = Time series / (Trend * Seasonal)”

random_beer = timeserie_beer - trend_beer - seasonal_beer
plot(as.ts(random_beer))

additive-random

random_air = timeserie_air / (trend_air * seasonal_air)
plot(as.ts(random_air))

multiplicative-random

Step 6: Reconstruct the Original Signal

The decomposed time series can logically be recomposed using the model formula to reproduce the original signal. Some data points will be missing at the beginning and the end of the reconstructed time series, due to the moving average windows which must consume some data before producing average data points.

Additive

The additive formula is “Time series = Seasonal + Trend + Random”, which means “Random = Time series – Seasonal – Trend”

Multiplicative

The multiplicative formula is “Time series = Seasonal * Trend * Random”, which means “Random = Time series / (Trend * Seasonal)”

recomposed_beer = trend_beer+seasonal_beer+random_beer
plot(as.ts(recomposed_beer))

additive-recomposed

recomposed_air = trend_air*seasonal_air*random_air
plot(as.ts(recomposed_air))

multiplicative-recomposed

DECOMPOSE( ) and STL(): Time Series Decomposition in R

To make life easier, some R packages provides decomposition with a single line of code. As expected, our step-by-step decomposition provides the same results as the DECOMPOSE( ) and STL( ) functions (see the graphs).

Additive

The only requirement: seasonality is quarterly (frequency = 4)

Using the DECOMPOSE( ) function:

Multiplicative

The only requirement: seasonality is monthly (frequency = 12)

ts_beer = ts(timeserie_beer, frequency = 4)
decompose_beer = decompose(ts_beer, "additive")

plot(as.ts(decompose_beer$seasonal))
plot(as.ts(decompose_beer$trend))
plot(as.ts(decompose_beer$random))
plot(decompose_beer)

additive-decompose

ts_air = ts(timeserie_air, frequency = 12)
decompose_air = decompose(ts_air, "multiplicative")

plot(as.ts(decompose_air$seasonal))
plot(as.ts(decompose_air$trend))
plot(as.ts(decompose_air$random))
plot(decompose_air)

multiplicative-decompose

Now using the STL( ) function:

ts_beer = ts(timeserie_beer, frequency = 4)
stl_beer = stl(ts_beer, "periodic")
seasonal_stl_beer   <- stl_beer$time.series[,1]
trend_stl_beer     <- stl_beer$time.series[,2]
random_stl_beer  <- stl_beer$time.series[,3]

plot(ts_beer)
plot(as.ts(seasonal_stl_beer))
plot(trend_stl_beer)
plot(random_stl_beer)
plot(stl_beer)

additive-stl

Conclusion

Decomposition is often used to remove the seasonal effect from a time series. It provides a cleaner way to understand trends. For instance, lower ice cream sales during winter don’t necessarily mean a company is performing poorly. To know whether or not this is the case, we need to remove the seasonality from the time series. Here, at Anomaly.io we detect anomalies, and we use seasonally adjusted time series to do so. We also use the random (also call remainder) time series from the decomposed time series to detect anomalies and outliers.

The post Extracting Seasonality and Trend from Data: Decomposition Using R appeared first on Anomaly.

]]>
https://anomaly.io/seasonal-trend-decomposition-in-r/feed/ 21
Moving Median is Robust to Anomalieshttps://anomaly.io/moving-median-robust-anomaly/ https://anomaly.io/moving-median-robust-anomaly/#comments Thu, 26 Nov 2015 18:34:55 +0000 https://anomaly.io/?p=2846 The post Moving Median is Robust to Anomalies appeared first on Anomaly.

]]>

moving-median

It can be difficult to detect the underlying trend of a time series in the presence of anomalies, due to unwanted noise. Fortunately, there are techniques to take into account those anomalies, so you can work with this kind of time series. One of them is the moving median.

What is Robustness?

Implicit or explicit assumptions are often made in statistics. These might be about the distribution, the independence, the randomness or other characteristics of data. Those assumptions might not be always true, but often are “true enough.For example, we can assume a time series to be normally distributed when in reality it is only almost normally distributed. We can then use this assumption to perform mathematical computations.

So what is robust statistics? It’s all about the “almost”. A non-robust statistical formula will yield results that make no sense when the assumption isn’t 100% true. Robust statistics, on the other hand, will produce results that make sense, perhaps with some small error, even though the assumption wasn’t 100% true. It also means that robust statistic should not be affected by outliers or anomalies.

In Robust Statistics 2nd Edition, Peter J. Hughes and Elvezio M. Ronchetti give this good definition about robust statistics:

“A minor error in the mathematical model should cause only a small error in the final conclusions.”

0470129905

Moving Average Versus Moving Median

Moving averages are commonly used to smooth or remove the noise of a time series. It works well, but the presence of anomalies can affect the underlying trend calculation. Robust statistics shouldn’t be affected by outliers or anomalies. Let’s demonstrate how the moving median formula is a robust statistic.

First, using R language, we create an anomalous seasonal time series:

createDay <- function(noise=0) {
  point=seq(0, pi, by=0.02)
  connection=sin(point)
  noise <- rnorm(length(point), 0, noise)
  return(connection+noise)
}

createDays <- function(totalDays, noise=0) {
  allDays <- c()
  for (day in 1:totalDays ) {
    allDays <- c(allDays, createDay(noise))
  }
  return(allDays)
}

set.seed(10)
days = createDays(3, 0.05)
p1 <- days[0:270]
p2 <- 2.5  # Anomaly here
p3 <- days[(280+2):length(days)]
strangeDay <- append(append(p1, p2), p3)

plot(as.ts(strangeDay), ylim=c(0, 3), col="#e15f3f", lwd = 2)

time-serie-anomaly

Raw time series with outliner

Moving Average Isn’t Robust

The commonly used moving average isn’t robust because it smoothes the anomaly, but doesn’t remove it. As a result, the outlier is “melted” into the size of the windows. In this example, the anomaly is allocated over 3 values.

#install and import lib
install.packages("forecast")
library(forecast)

movingAverage = ma(strangeDay, 3)

plot(as.ts(movingAverage), ylim=c(0, 3), lwd = 2)

time-serie-moving-average

Moving average keeps the outlier

Moving Median Is Robust!

The moving median isn’t as popular as the moving average, but offers some interesting capabilities. The moving median provides a more robust estimate of the trend compare to the moving average. It isn’t affected by outliers: in fact, it removes them!

movingMedian = runmed(strangeDay, 3)
plot(as.ts(), ylim=c(0, 3),
     col="#27ccc0", lwd = 2)

The moving median matches the definition of a robust statistic. “A minor error [the anomaly] in the mathematical model [the moving median] should cause only a small error in the final conclusions”.

time-serie-moving-median

Moving median “removes” the outlier

The post Moving Median is Robust to Anomalies appeared first on Anomaly.

]]>
https://anomaly.io/moving-median-robust-anomaly/feed/ 0