Anomaly » detection

Salesforce Anomaly Detection Using Anomaly.io

martin-magakian — Mon, 30 Jan 2017 14:42:52 +0000

Monitoring Key Performance Indicators (KPIs) is essential to running a successful business. As one example, you should examine your lead generation KPI several times a day, to allow you to detect and correct problems as quickly as possible.

But you’re busy—you don’t have time to watch KPI indicators all day long. That’s where Anomaly.io comes in. By combining our detection algorithms with your Salesforce data, you can automatically detect problems and notify the appropriate personnel to ensure that speedy corrective action is taken.

From Salesforce to InfluxDB (Anomaly.io Protocol)

Let’s see how we can forward Salesforce data to Anomaly.io for automatic anomaly detection.

First, install the StreamSets Data Collector, which can easily subscribe to and stream Salesforce events. Next, log into your fresh StreamSets install with the default login/password combination “admin/admin”.
Now install the packages “Salesforce Library” and “InfluxDB 0.9+” from the PackageManager:

Create a new Pipeline. Click the “Stage Library” (top right icon), select the origin “Salesforce”, then the processor “Groovy Evaluator”, and finally, the destination “InfluxDB”.

Now link the nodes together:

Salesforce Node

This node collects historical “Opportunities” and subscribes to upcoming ones through the Salesforces Streaming API

Enter this data on the Salesforce tab:

Username: salesforceUserName
Password: MyPassword+securityToken
e.g: MyPassword1FPeZD9z[…]
Reset your salesforce security token
API version: 36.0
Query Existing Data: Yes
Subscribe for Notifications: Yes

IMPORTANT: Be sure to enter “36.0” for the API, as currently this is the only compatible version.

On the Query tab:

Use Bulk API: Yes
SOQL Query:
SELECT Id, AccountId, CreatedDate, CloseDate, Amount
FROM Opportunity WHERE Id > ‘${OFFSET}’ ORDER BY Id

And finally, on the Subscribe tab:

Push Topic: OpportunityCreate

Groovy Node

This node transforms Salesforce output into InfluxDB input. You need to do the following:

Set the required AccountId, if it is missing.
Translate Salesforce time into InfluxDB time.
Set a “measurement” to save the events into.

Example code for the Groovy tab:

for (record in records) {
 try {
  record.value['measurement'] = 'opportunity'
  Date date = Date.parse("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", record.value['CreatedDate'])
  record.value['CreatedDate'] = date.getTime()
  if(record.value['AccountId'] == null){
   record.value['AccountId'] = "unknow"
  }
  output.write(record)
 } catch (e) {
  log.error(e.toString(), e)
  error.write(record, e.toString())
 }
}

InfluxDB Node

This node sends the data to Anomaly.io, which uses the InfluxDB protocol, for automatic anomaly detection.

Enter the following on the InfluxDB tab:

URL: http://events.anomaly.io:8086
User: yourAnomalyUser
Password: yourAnomalyPasswd
Database Name: salesforce
Auto-create Database: Yes
Record Mapping: Custom Mappings
Measurement Field: /measurement
Tag Fields: /AccountId
Value Fields: /Amount /CreatedDate /CloseDate

Register Salesforce PushTopic

In our StreamSets configuration, the Salesforce origin will request all historical data, then subscribe to a channel we called “OpportunityCreate“. To set this up, we need to create this PushTopic in Salesforce:

Login to Salesforce
Open the Developer Console.
Select Debug | Open Execute Anonymous Window.
In the Enter Apex Code window, paste in the following Apex code, then click Execute.

PushTopic pushTopic = new PushTopic();
pushTopic.Name = 'OpportunityCreate';
pushTopic.Query = 'SELECT Id, AccountId, CreatedDate, CloseDate, Amount FROM Opportunity';
pushTopic.ApiVersion = 36.0;
pushTopic.NotifyForOperationCreate = true;
pushTopic.NotifyForOperationUpdate = false;
pushTopic.NotifyForOperationUndelete = false;
pushTopic.NotifyForOperationDelete = false;
pushTopic.NotifyForFields = 'Referenced';
insert pushTopic;

Run Streaming from Salesforce to InfluxDB (or Anomaly.io)

Click outside any node, then go to Configuration > Error Records and select “(Library: Basic)”. Now we can finally run the pipeline:

You should see something like the following. Note that here a total of 32 “Opportunities” were processed, and StreamSets is waiting for new events to come in:

Salesforce historical data has now been imported into InfluxDB. What about future “Opportunities”? Let’s check to make sure the Salesforce Streaming API works:

Login to workbench.developerforce.com
Go to Data > Insert | Object Type “Opportunity” | Next
Enter:
AccountId: 0010Y00000DX4EgQAL
Amount: 100
CloseDate: 2019-05-27T10:43:35.000Z
Name: My Opportunity
StageName: Open

Data should be processed within one second and then sent to InfluxDB or Anomaly.io.

Detect Anomalies in Salesforce

Using Anomaly.io

You can spend a lot of money on R&D, or you can save that money and use anomaly.io to detect exceptions and issues in your data. After a week of intense machine learning, our algorithm will pick the best anomaly detection method for your data. Our system can automatically apply techniques such as k-mean clustering, anomaly correlation, breakout detection and trend decomposition in real time, along with others.

Simply forward your data to anomaly.io; we’ll keep an eye on it.

Using Custom Anomaly Detection

Anomaly detection is complex, and especially difficult when you need to work with unknown problems. Your users and business will evolve, so you can’t rely on just setting simple thresholds.

Before you start, always check to make sure that your data is compatible with the algorithms you are using. For example, a common error is to use an algorithm intended for normally distributed time series when your data isn’t normally distributed.

You can also forward your Salesforce data into an open source solution such as Twitter Anomaly Detection, Twitter Breakout, or Skyline, or develop your own.

The post Salesforce Anomaly Detection Using Anomaly.io appeared first on Anomaly.

Detecting Anomalies in Correlated Time Series

martin-magakian — Wed, 25 Jan 2017 15:22:24 +0000

Monitoring key performance indicators (KPIs), sales or any other product data means working within an ecosystem where very often you will see metrics correlating with each other. When a normal correlation between two metrics is broken, we have reason to suspect something strange is happening.

As an example, take a look at anomaly.io analytics during its early days (a long time ago). In the graphic above, the new users are shown in green and the returning users in red. Clearly, something strange happens in the middle of November. Let’s use some techniques to find out more!

Step 1: Collect the Data

With Google Analytics it’s fairly easy to download data on new and returning visitors. Here at Anomaly.io we detect problems in real time, so we prefer streaming our Google Analytics in real time. But for the sake of simplicity, we will use a plain CSV file for this exercise.

You can download your own Google Analytics data, or use my sample monitoring data file if you wish: new-vs-returning-visitor.csv.

#R language
install.packages("dplyr")
library(dplyr)

#load data
csv = read.csv("/path/to/new-vs-returning-visitor.csv")

#load new visitor
new = filter(csv,Segment=="New Users")
new_data = new[4]$Sessions
new_date = as.Date(lapply(new[1], as.character)$Day_Index, "%m/%d/%y")
new_df = data.frame(date=new_date, value=new_data)

#load returning visitor
return = filter(csv,Segment=="Returning Users")
return_data = return[4]$Sessions
return_date = as.Date(lapply(return[1], as.character)$Day_Index, "%m/%d/%y")
return_df = data.frame(date=return_date, value=return_data)

#plot
plot_df = function(new_df, return_df = NULL, ylim=c()) {
	plot(value ~ date, new_df, xaxt="n", type="l", col="#00c7b0", ylim=ylim, lwd=2)
	date_axis = new_df$date[1:(length(new_df$date)/30.5)*30.5]+2
	axis(1, date_axis, format(date_axis, "%b"), cex.axis = .7)
	if(!is.null(return_df)) {
		lines(value ~ date, return_df, xaxt="n", type="l", col="#ff0816", lwd=2)
	}
}
plot_df(new_df, return_df)
legend("topleft", c("new visitor","returning visitor"),
      col=c("#00c7b0","#ff0816"), lty=c(1,1))

Now you have two variables in your R environment:

new_df: new visitors
return_df: returning visitors

Step 2: Verify the Cross-Correlation

Our analysis is based on the time series being correlated, so before going any further, let’s ensure that this is the case. To do so, we need to check using Cross-Correlation. Check this article if you want to Understand the Cross-Correlation algorithm.

BONUS: Find correlated metrics in multiple times series

library(stats)
library(forecast)
ma_new_data = ma(new_data, 3)
ma_new_data = ma_new_data[!is.na(ma_new_data)]
ma_return_data = ma(return_data, 3)
ma_return_data = ma_return_data[!is.na(ma_return_data)]

corr = ccf(ma_new_data, ma_return_data, lag.max = 0)
corr # display 0.876

As the Moving Average is robust to anomaly we use it to remove potential outliers before computing the correlation. The Cross Correlation Function (CCF) is a very high value of 0.876. Clearly, the time series are correlated.

Step 3: Subtract the Time Series

We can compute that there are 1.736 times more new visitors than returning visitors. Let’s align the metrics at the same level.

multi = sum(ma_new_data/ma_return_data)/length(ma_return_data)
multi #display 1.736529
align_return_df = return_df
align_return_df$value = align_return_df$value*multi
plot_df(new_df, align_return_df)
legend("topleft", c("new visitor","returning visitor x1.736"), 
     col=c("#00c7b0","#ff0816"), lty=c(1,1))

substract = return_df
substract$value = new_df$value - return_df$value*multi
plot_df(substract)

Step 4: Find Outliers in Correlated Time Series

The resulting signal looks like “noise”. Plotting the histogram show a normally distributed signal. We already know how to detect anomalies in a normally distributed time series using the 3-sigma rule.

hist(substract$value, breaks = 50)

The 3-sigma rules express that nearly all values are taken to lie within three standard deviations of the mean. Everything outside this range can be treated as anomalous.

#compute 3 sigma minimal & maximal
min = mean(substract$value, na.rm = T) - 3*sd(substract$value, na.rm = T)
max = mean(substract$value, na.rm = T) + 3*sd(substract$value, na.rm = T)

# plot metric & limit
plot_df(substract, NULL, c(-120,320))
abline(h=max, col="#e15f3f", lwd=2)
abline(h=min, col="#e15f3f", lwd=2)

# plot cercle
position = data.frame(id=seq(1, length(substract$value)), value=substract$value, date=substract$date)
anomalyH = position[position$value > max, ]
anomalyH = anomalyH[!is.na(anomalyH$value), ]
anomalyL = position[position$value < min, ]
anomalyL = anomalyL[!is.na(anomalyL$value), ]
anomaly = data.frame(id=c(anomalyH$id, anomalyL$id), value=c(anomalyH$value, anomalyL$value), date=c(anomalyH$date, anomalyL$date))
anomaly = anomaly[!is.na(anomaly$value), ]
points(x = anomaly$date, y=anomaly$value, col="#e15f3f")

Conclusion

Looking at the initial retention graph, I was able to spot 1 or 2 anomalies when in reality I had 7. I actually remember back in mid-November when I received the automatic anomaly report. I started digging into the problem, and found out that someone had shared my twitter anomaly review on Hacker News.

The post Detecting Anomalies in Correlated Time Series appeared first on Anomaly.

Detecting Correlation Among Multiple Time Series

martin-magakian — Thu, 10 Mar 2016 14:09:53 +0000

To determine the level of correlation between various metrics we often use the normalized cross-correlation formula.2

Definition: Normalized Cross-Correlation

Normalized cross-correlation is calculated using the formula:

$$norm\_corr(x,y)=\dfrac{\sum_{n=0}^{n-1} x[n]*y[n]}{\sqrt{\sum_{n=0}^{n-1} x[n]^2 * \sum_{n=0}^{n-1} y[n]^2}}$$

We recommend first understanding normalized cross correlation before using it, but any statistical language, such as R, can easily compute it for you.

Correlations between 2 metrics

In the following graph, the two metrics show some correlation between each other.

# plot the graph in R
set.seed(15)
a = c(1,2,-2,4,2,3,1,0,3,4,2,3,1)
b = a + rnorm(length(a), sd = 0.4)
plot(ts(b), col="#f44e2e", lwd=3)
lines(a, col="#27ccc0", lwd=3)

Using R to compute the normalized cross-correlation is as easy as calling the function CCF (for Cross Correlation Functions). By default, CCF plots the correlation between two metrics at different time shifts. It’s easy to understand time shifting, which simply moves the compared metrics to different times. This is useful in detecting when a metric precedes or succeeds another.

# compute using the R language
corr = ccf(a,b)
corr

-8	-7	-6	-5	-4	-3	-2	-1	0	1	2	3	4	5	6	7	8
-0.011	-0.146	0.061	0.266	-0.025	-0.246	0.018	-0.293	0.979	-0.291	0.098	-0.320	-0.043	0.288	0.037	-0.183	0.083

The last R command displays the correlation between the metrics at various time shift values. As expected, the metrics are highly correlated at time shift 0 (no time shift) with a value of 0.979.

Cluster Correlated Metrics Together

We can also use the CCF function to cluster similar metrics together based how similar they are. To demonstrate this better, we will cluster metrics from a real data set of 45 graphs call “graph45.csv”.

First, we need to compute the correlating level between every possible pair of graphs. This is what the “correlationTable” function does. (To reproduce this example you must download the data set graphs45.csv).

correlationTable = function(graphs) {
  cross = matrix(nrow = length(graphs), ncol = length(graphs))
  for(graph1Id in 1:length(graphs)){
    graph1 = graphs[[graph1Id]]
    print(graph1Id)
    for(graph2Id in 1:length(graphs)) {
      graph2 = graphs[[graph2Id]]
      if(graph1Id == graph2Id){
        break;
      } else {
        correlation = ccf(graph1, graph2, lag.max = 0)
        cross[graph1Id, graph2Id] = correlation$acf[1]
      }
    }
  }
  cross
}

graphs = read.csv("graphs45.csv")
corr = correlationTable(graphs)

It took around 20 seconds to compute all the correlation possibilities between every pair of graphs. The array corr now contains the correlation table; for example, corr[4,3] gives a correlation level of 0.990 between graph4 and graph3. Such a high correlation level indicates a strong correlation between the graphs. To find metrics with sufficiently high correlation, we choose a minimum correlation level of 0.90.

Let’s find and plot all the metrics that strongly correlate with graph4:

findCorrelated = function(orig, highCorr){
 match = highCorr[highCorr[,1] == orig | highCorr[,2] == orig,]
 match = as.vector(match)
 match[match != orig]
}

highCorr = which(corr > 0.90 , arr.ind = TRUE)
match = findCorrelated(4, highCorr)
match # print 6 12 23 42 44 45  3

Success! Graph4 highly correlates with graphs 6, 12, 23, 42, 44, 45 and 3.

Let’s now plot all the graphs together:

bound = function(graphs, orign, match) {
  graphOrign = graphs[[orign]]
  graphMatch = graphs[match]
  allValue = c(graphOrign)
  for(m in graphMatch){
    allValue = c(allValue, m)
  }
  c(min(allValue), max(allValue))
}

plotSimilar = function(graphs, orign, match){
  lim = bound(graphs, orign, match)

  graphOrign = graphs[[orign]]
  plot(ts(graphOrign), ylim=lim, xlim=c(1,length(graphOrign)+25), lwd=3)
  title(paste("Similar to", orign, "(black bold)"))

  cols = c()
  names = c()
  for(i in 1:length(match)) {
    m = match[[i]]
    matchGraph = graphs[[m]]
    lines(x = 1:length(matchGraph), y=matchGraph, col=i)

    cols = c(cols, i)
    names = c(names, paste0(m))
  }
  legend("topright", names, col = cols, lty=c(1,1))
}

plotSimilar(graphs, 4, match)

Conclusion

Here at anomaly.io, finding cross-correlation is one of the first steps in detecting unusual patterns in your data. Subtracting two correlated metrics should result in an almost flat signal. If suddenly the flat signal (or the gap between the curves) hits a certain level, you can trigger an anomaly. Of course this is an oversimplification and the reality is much more complex, but it’s a good foundation to work from.

The post Detecting Correlation Among Multiple Time Series appeared first on Anomaly.

Change Point Detection with Seasonal Time Series

martin-magakian — Mon, 08 Feb 2016 10:26:58 +0000

Previously, we looked at using Twitter Breakout (EDM) to detect Anomalies. As with the popular E-Divisive, EDM detects mean shift and changes in distribution. Both algorithms work with seasonal time series, but perform even better without seasonality.

Change Point Doesn’t Work (Well) with Seasonality

To demonstrate the “weakness” of change point, let’s generate some fake seasonal time series. Then we will try to detect anomalies using two different change point detection algorithms: EDM and E-Divisive.

# install E-Divisive and EDM
install.packages("Rcpp")
install.packages("ecp")
install.packages("devtools")
devtools::install_github("twitter/BreakoutDetection")
library(Rcpp)
library(ecp)
library(BreakoutDetection)

createDay = function(noise=0) {
 point=seq(0, pi, by=0.02)
 connection=sin(point)
 noise = rnorm(length(point), sd = noise)
 return(connection+noise)
}

createDays = function(totalDays, noise=0) {
 allDays = c()
 for (day in 1:totalDays ) {
 allDays = c(allDays, createDay(noise))
 }
 return(allDays)
}

set.seed(1234)
p1 = createDays(3, 0.2)
anomaly = createDays(1, 0)*2 + rnorm(158, sd = 0.07)
days = c(p1, anomaly)
plot(as.ts(days))

# EDM - fail
res = breakout(days, min.size=158, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

# E-Divisive - fail
ediv = e.divisive(as.matrix(days), min.size=158, alpha=1)
plot(as.ts(days))
abline(v=ediv$estimates,col="blue")

As we can see, due to the seasonality of the time series, traditional change point detection doesn’t work very well.

Removing the Seasonality

To use change point detection effectively, we need to remove the seasonality from our time series. And to do that, we need to know the period of the seasonality. In this case, we know the seasonality to be 158 data points per day. If we don’t know, it’s possible to calculate the seasonality using a Fourier Transform. We also need to know if the time series is multiplicative or additive. In our example, I have an additive time series, but most of the time it is multiplicative. With this information, we can now decompose to remove the seasonality. Finally, we can run the change point detection again to get a successful result.

#remove seasonal
decomposed_days = decompose(ts(days, frequency = 158), "additive")
noSeasonal = days - decomposed_days$seasonal

#remove seasonal
decomposed_days = decompose(ts(p1, frequency = 158), "additive")
noSeasonal = days - rep(decomposed_days$seasonal[0:158],4)

# EDM - success
res = breakout(noSeasonal, min.size=158, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

# E-Divisive - success
ediv = e.divisive(as.matrix(noSeasonal), min.size=158, alpha=1)
plot(as.ts(noSeasonal))
abline(v=ediv$estimates,col="blue")

With the seasonality removed, breakout EDM and E-Divisive work a little bit better.

The post Change Point Detection with Seasonal Time Series appeared first on Anomaly.

Anomaly Detection with Twitter Breakout

martin-magakian — Wed, 20 Jan 2016 14:06:02 +0000

We previously tested Twitter Anomaly Detection package using the R language. Now let’s take a look at Twitter Breakout Detection.

What is Twitter Breakout Detection?

This Twitter package is intended to detect changes in time series. It is describe as an E-Divisive with Medians (EDM). It is supposed to:

Detect divergence (mean shift, ramp up)
Detect changes in distribution
Work 3.5× faster than other breakout detection methods
Be robust in the presence of anomalies

To detect divergence the algorithm uses a clustering technique (mean shift clustering). As EDM is non-parametric, the data doesn’t need to follow any specific distribution; it can adapt to the current distribution. As a result, it can be used to detect when the distribution changes.

We previously explained how moving medians are robust to anomalies. The 3.5x speed improvement compare to E-Divisive is in part due to the use of Interval Trees to approximate the median very efficiently. However, this isn’t the only reason for the speed increase.

Let’s try it out ourselves.

Divergence Detection (Mean Shift, Ramp Up)

Mean Shift

A sudden jump in the time series is call a mean shift and represents the time series switching from one steady state to another. A good example (used by Twitter) is to imagine CPU utilization suddenly jumping from 40% to 60%. Let’s generate a fake time series to test the detection process.

# install BreakoutDetection
install.packages("devtools")
devtools::install_github("twitter/BreakoutDetection")
library(BreakoutDetection)

set.seed(123)
p1 = rnorm(60, mean = 0, sd = .4)
p2 = rnorm(60, mean = 6, sd = .4)
p3 = rnorm(60, mean = 0, sd = .4)
p4 = rnorm(60, mean = 3, sd = .4)
p5 = rnorm(60, mean = 0, sd = .4)
p6 = rnorm(60, mean = 6, sd = .4)
p7 = rnorm(60, mean = 0, sd = .4)
all = c(p1,p2,p3,p4,p5,p6,p7)
plot(as.ts(all), col = "#27ccc0", lwd = 4)

res = breakout(all, min.size=20, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

Indeed, we were able to detect 7 mean divergences, also called break downs.
A simple help(breakout) command tells us more about the package.

min.size: The minimum number of observations between change points.
method: Method must be one of either “amoc” (At Most One Change) or “multi” (Multiple Changes). For “amoc.” at most one change point location will be returned.
degree: The degree of the penalization polynomial. Degree can take the values 0, 1, and 2. The default value is 1.
beta: A real number constant used to further control the amount of penalization. This is the default form of penalization; it will be used if neither beta nor percent are supplied. The default value is beta=0.008.

Ramp Up

A ramp up is a slow transition from one steady state to another. An example would be CPU utilization slowly transitioning from 40% to 60% over time. Let’s detect this kind of divergence.

ramp = 40
set.seed(1234)
p1 = rnorm(100, mean = 0, sd = .4)
p2 = rnorm(ramp, mean = 6, sd = 1) * seq(0, 1, length.out = ramp)
p3 = rnorm(100, mean = 6, sd = .4)
p4 = rnorm(ramp, mean = 6, sd = 1) * seq(1, 0, length.out = ramp)
p5 = rnorm(100, mean = 0, sd = .4)
p6 = rnorm(ramp, mean = 6, sd = 1) * seq(0, 1, length.out = ramp)
p7 = rnorm(100, mean = 6, sd = .4)
p8 = rnorm(ramp, mean = 6, sd = 1) * seq(1, 0, length.out = ramp)
p9 = rnorm(100, mean = 0, sd = .4)
all = c(p1,p2,p3,p4, p5,p6,p7,p8,p9)
plot(as.ts(all), col = "#27ccc0", lwd = 2)

res = breakout(all, min.size=20, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

Even with the ramp up, breakdown still detects 7 distinct states. It’s very powerful to be able to see the ramp up as a transition and not as a breakdown.

Detect Changes in Distribution

EDM doesn’t make assumptions about the distribution of the time series. Instead, it learns the current distribution and uses it as a reference. When the distribution suddenly changes, EDM can detect the variation.

set.seed(123456) 
p1 = rnorm(1000, mean = 1)
p2 = rgamma(1000, shape = 1)
p3 = rpois(1000, lambda = 3)/2 - .5
p4 = runif(1000) + .5
p5 = rexp(1000)
p6 = rweibull(1000, shape = 1)

all = c(p1, p2, p3, p4, p5, p6)
plot(as.ts(all))
abline(v=seq(100,1000*6,100),col="blue")

res = breakout(all, min.size=200, method='multi', beta=0.00018, degree=1, plot=TRUE)
res$plot

In the example above we switch alternatively from the Normal distribution to the Gamma, Poisson, Uniform, Exponential and Weibull distributions. With some (difficult) fine tuning, EDM can detect the change from Normal to Gamma distributions and from Uniform to Exponential. Unfortunately, it doesn’t detect the change from Poisson to Uniform distribution or from Exponential to Weibull. We believe that this is due to its ability to perform in the presence of anomalies as the running Median removes outliers. Without anomalies, E-Divisive performs better. The blue line shows where E-Divisive notifies that there has been a change in distribution.

#install package
install.packages("ecp")
library(ecp)

# ~ 30 min on 2.6Ghz CPU
ediv = e.divisive(as.matrix(all), min.size=200, alpha=1)
plot(as.ts(all))
abline(v=ediv$estimates,col="blue")

3.5× Greater Speed Than Other Breakout Detection Methods

E-Divisive detects changes in distribution as soon as they occur, but is very slow compared to the EDM algorithm. One of the reasons why EDM is much faster is due to the use of interval trees to approximate the median.

Data points	E-Divisive	EDM
6000	29 min	24 s
600	7 s	1 s

Based on only two tests, we can see that EDM performs much faster than E-Divisive.

Robustness in the Presence of Anomalies

As explained earlier, EDM stand for E-Divisive with Medians. In a previous article, we showed how the running median is robust to anomalies. As a result, an anomaly in the time series isn’t detected as a mean shift.

set.seed(12345)
p = rnorm(20, mean = 1)
all = c(p, 25, p, 30, p, 10, p, 20, p, 70, p , 50, p) #Add Anomalies

res = breakout(all, min.size=40, method='multi', beta=0.001, degree=1, plot=TRUE)
res$plot

Anomalies aren’t detected as “breakouts” or mean shifts. EDM is, indeed, robust to anomalies.

The post Anomaly Detection with Twitter Breakout appeared first on Anomaly.

Detecting Anomalies with Moving Median Decomposition

martin-magakian — Tue, 12 Jan 2016 10:12:21 +0000

Time series decomposition splits a time series into seasonal, trend and random residual time series. The trend and the random time series can both be used to detect anomalies. But detecting anomalies in an already anomalous time series isn’t easy.

TL;DR

When working on an anomalous time series:

Anomaly detection with moving average decomposition doesn’t work
Anomaly detection with moving median decomposition works

The Problem with Moving Averages

In the blog entry on time series decomposition in R, we learned that the algorithm uses a moving average to extract the trends of time series. This is perfectly fine in time series without anomalies, but in the presence of outliers, the moving average is seriously affected, because the trend embeds the anomalies.

In this demonstration, we will first detect anomalies using decomposition with a moving average. We’ll see that this doesn’t work well, and so will try detecting anomalies using decomposition with a moving median to get better results.

About the data: webTraffic.csv reports the number of page view per day over a period of 103 weeks (almost 2 years). To make the data more interesting, we added some (extra) anomalies to it. Looking at the time series, we clearly see a seasonality of 7 days; there is less traffic on weekends.

To decompose a seasonal time series, the seasonality time period is needed. In our example, we know the seasonality to be 7 days. If unknown, it is possible to determine the seasonality of a time series. Last but not least, we need to know if the time series is additive or multiplicative. Our web traffic is multiplicative.

To sum up our web traffic:

Seasonality of 7 days (over 103 weeks)
Multiplicative time series
Download here: webTraffic.csv

set.seed(4)
data <- read.csv("webTraffic.csv", sep = ",", header = T)
days = as.numeric(data$Visite)
for (i in 1:45 ) {
 pos = floor(runif(1, 1, 50))
 days[i*15+pos] = days[i*15+pos]^1.2
}
days[510+pos] = 0
plot(as.ts(days))

Moving Average Decomposition (Bad Results)

Before going any further, make sure to import the data. Also, I recommend being sure that you understand how time series decomposition works. The stats package provides the handy decompose function in R.

1 – Decomposition

As the time series is anomalous during the decomposition, the trends become completely wrong. Indeed, the anomalies are averaged into the trend.

# install required library
install.packages("FBN")
library(FBN)

decomposed_days = decompose(ts(days, frequency = 7), "multiplicative")
plot(decomposed_days)

2 – Using Normal Distribution to Find Minimum and Maximum

To the random noise we can apply the normal distribution to detect anomalies. Values under or over 4 times the standard deviation can be considered outliers.

random = decomposed_days$random
min = mean(random, na.rm = T) - 4*sd(random, na.rm = T)
max = mean(random, na.rm = T) + 4*sd(random, na.rm = T)

plot(as.ts(as.vector(random)), ylim = c(-0.5,2.5))
abline(h=max, col="#e15f3f", lwd=2)
abline(h=min, col="#e15f3f", lwd=2)

3 – Plotting Anomalies

Let’s find the values either over or under 4 standard deviations to plot the anomalies. As expected, the calculation fails to find many anomalies, due to the use of moving average by the decomposition algorithm.

#find anomaly
position = data.frame(id=seq(1, length(random)), value=random)
anomalyH = position[position$value > max, ]
anomalyH = anomalyH[!is.na(anomalyH$value), ]
anomalyL = position[position$value < min, ]
anomalyL = anomalyL[!is.na(anomalyL$value), ]
anomaly = data.frame(id=c(anomalyH$id, anomalyL$id),
 value=c(anomalyH$value, anomalyL$value))
anomaly = anomaly[!is.na(anomaly$value), ]

plot(as.ts(days))
real = data.frame(id=seq(1, length(days)), value=days)
realAnomaly = real[anomaly$id, ]
points(x = realAnomaly$id, y =realAnomaly$value, col="#e15f3f")

Moving Average Decomposition (Good Results)

In our second approach, we will do the decomposition using a moving median instead of a moving average. This requires understanding how the moving median is robust to anomalies and how time series decomposition works. Also, again make sure to import the data. The moving median removes the anomalies without altering the time series too much.

1 – Applying the Moving Median

The decomposed trend with moving median (see above graphic) is a bit “raw” but expresses the trend better.

#install library
install.packages("forecast")
library(forecast)
library(stats)

trend = runmed(days, 7)
plot(as.ts(trend))

2 – Using the Normal Distribution to Find the Minimum and Maximum

As the decomposition formula expresses, removing the trend and seasonality from the original time series leaves random noise. As it should be normally distributed, we can apply the normal distribution to detect anomalies. As we saw previously, values under or over 4 times the standard deviation can be considered outliers. As random time series still contain the anomalies, we need to estimate the standard deviation without taking the anomalies into account. Once again, we will use the moving median to exclude the outliers.

detrend = days / as.vector(trend)
m = t(matrix(data = detrend, nrow = 7))
seasonal = colMeans(m, na.rm = T)
random = days / (trend * seasonal)
rm_random = runmed(random[!is.na(random)], 3)

min = mean(rm_random, na.rm = T) - 4*sd(rm_random, na.rm = T)
max = mean(rm_random, na.rm = T) + 4*sd(rm_random, na.rm = T)
plot(as.ts(random))
abline(h=max, col="#e15f3f", lwd=2)
abline(h=min, col="#e15f3f", lwd=2)

3 – Plotting Anomalies

As before, values over or under 4 times the standard deviation are plotted as anomalous. This works much better; we detect almost every anomaly! Once again, this is because the moving median is robust to anomalies.

#find anomaly
position = data.frame(id=seq(1, length(random)), value=random)
anomalyH = position[position$value > max, ]
anomalyH = anomalyH[!is.na(anomalyH$value), ]
anomalyL = position[position$value < min, ]
anomalyL = anomalyL[!is.na(anomalyL$value)]
anomaly = data.frame(id=c(anomalyH$id, anomalyL$id),
 value=c(anomalyH$value, anomalyL$value))
points(x = anomaly$id, y =anomaly$value, col="#e15f3f")

plot(as.ts(days))
real = data.frame(id=seq(1, length(days)), value=days)
realAnomaly = real[anomaly$id, ]
points(x = realAnomaly$id, y =realAnomaly$value, col="#e15f3f")

Here is an alternative plot displaying the same results: values outside the blue area can be considered anomalous.

high = (trend * seasonal) * max
low = (trend * seasonal) * min
plot(as.ts(days), ylim = c(0,200))
lines(as.ts(high))
lines(as.ts(low))

x = c(seq(1, length(high)), seq(length(high), 1))
y = c(high, rev(low))
length(x)
length(y)
polygon(x,y, density = 50, col='skyblue')

The post Detecting Anomalies with Moving Median Decomposition appeared first on Anomaly.

Detecting Seasonality Using Fourier Transforms in R

martin-magakian — Thu, 06 Aug 2015 10:00:49 +0000

Our brains are really fast at recognizing patterns and forms: we can often find the seasonality of a signal in under a second. It is also possible do this with mathematics using the Fourier transform.

First, we will explain what a Fourier transform is. Next, we will find the seasonality of a website from its Google Analytics pageview report using the R language.

Fourier Transforms Explained

The Fourier transform decomposes a signal into all the possible frequencies that comprise it. Fine, but how does it really work? (And keep it simple, please?)

1 – Pick a Frequency

First, the Fourier transform starts with the smallest frequency as possible. For a signal made of 100 points, the smallest frequency possible is 1/100 = 0.01 Hz. Think of a circle turning at a speed of 0.01 Hz, or 0.01 second if the points are recorded every second. Just like a clock.

1 circle turn = 0.01 s

2 – “Draw” the Signal Value on the Circle

Using the previously selected frequency we “draw” (decompose) the entire signal on the circle.

1 circle turn = 0.01 s

Notice: When the signal measurement is high, the clock arm of the circle is high.
The above signal will draw something like this:

Signal decomposition at 0.01 Hz

3 – Compute the Periodogram

All the measurements are now on the circle in two dimensions. We call these “vectors”. Summing the vectors together give the final “power” of the frequency. This occurs when all the vectors line up and point in the same direction, creating high values that represent the “power” of the frequency.

4 – Repeat with Different Frequencies

To complete the Fourier transform we repeat the process with different frequencies: 0.02 Hz, 0.03 Hz … We continue until all the “power” of each possible frequency has been computed. The frequencies with the highest power represent the greatest periodicity!

Detecting Seasonality (Example)

We just saw that the frequency with the highest “power” represents the primary seasonality of our underlying metric. In this example, William Cox draws at high speed the signal value on the circle for many different frequencies.

Here is how the original signal value appears:

To learn more, I recommend litening to William Cox talk about Fourier transforms:

Your browser does not support the video tag.

In this example, a frequency of 2.1 gets a very high “power” value. This is where all the data line up and create a big vector.

From the frequency formula:
T = 1 / f
T = 1 / 2.1 Hz = 476 ms

The signal repeats every 476 ms. This is its seasonality!

Detecting Seasonality using R

My personal tech blog clearly shows some weekly trends: It receives much less traffic during the weekend. As a result, my Google Analytics report shows some sort of weekly periodicity. Let’s try to find the seasonality using the R language.

Download my Google Analytics pageview report.

# Install and import TSA package
install.packages("TSA")
library(TSA)

# read the Google Analaytics PageView report
raw = read.csv("20131120-20151110-google-analytics.csv")

# compute the Fourier Transform
p = periodogram(raw$Visite)

The periodogram shows the “power” of each possible frequency, and we can clearly see spikes at around frequency 0.15 Hz

dd = data.frame(freq=p$freq, spec=p$spec)
order = dd[order(-dd$spec),]
top2 = head(order, 2)

# display the 2 highest "power" frequencies
top2

freq	spec
0.142661180	167143.1
0.002743484	109146.8

Frequencies of 0.142661180 Hz and 0.002743484 Hz show seasonality!

# convert frequency to time periods
time = 1/top2$f
time

> time
[1] 7.009615 364.500000

The main seasonality detected is 7 days. A secondary seasonality of 364.4 days was also found. Exactly what we expected! My blog has weekly seasonality as well as annual seasonality.

The post Detecting Seasonality Using Fourier Transforms in R appeared first on Anomaly.

Detecting Anomalies with Skyline

martin-magakian — Tue, 28 Jul 2015 16:00:43 +0000

Skyline is free and open source anomaly detection software. Simply plug Graphite metrics into Skyline to detect anomalous behaviour automatically without any configuration.

Skyline Architecture

Skyline is usually set to use Graphite metrics. This is done by placing a daemon call “carbon-relay” in front of the usual Graphite stack. Carbon-relay will repeat the metrics to multiple hosts. One of them will be the essay editing service traditional Graphite stack, and the other host will be the Skyline stack.

Skyline is composed of several parts:

Horizon – Responsible for collecting, cleaning, and formatting incoming metrics before pushing to a Redis database.
Analyzer – Fetches metrics from Redis and runs mathematical equations to detect anomalies
Skyline-webapp – A Django webapp to display an anomaly graph when it occurs

In the above schema, CollectD is used to monitor and push server metrics to carbon-cache. Read more about Graphite architecture.

Install Graphite

Graphite Required

Skyline is commonly used with Graphite. That’s why you need to Install Graphite on CentOS.
You don’t have to run the Graphite full stack; running carbon-relay is enough. But, as carbon-relay is part of Graphite, you still have to install Graphite.

Configure Carbon-relay

Set the list of hosts carbon-relay needs to forward its metrics to.

sudo vim /opt/graphite/conf/relay-rules.conf

# change the file
[...]
destinations = 127.0.0.1:2004, 127.0.0.1:2024
[...]

With this configuration, carbon-relay will forward metrics to Skyline on port 2024 and to paper writers carbon-cache (Graphite) on port 2004. Both services run locally in this tutorial. Also, edit the carbon configuration:

sudo vim /opt/graphite/conf/carbon.conf

# change the file
[...]
DESTINATIONS = 127.0.0.1:2004, 127.0.0.1:2024
[...]

Start carbon-relay

Start carbon-relay using systemd:

sudo systemctl restart carbon-relay

Install Skyline

Required

Open a terminal and install a few tools. This includes an Apache server (which is probably already installed), the Redis database, and Python with few mathematical libraries and some compile tools:

sudo yum -y install httpd redis git
sudo yum -y install gcc gcc-c++ git pycairo mod_wsgi
sudo yum -y install python-pip python-devel blas-devel lapack-devel libffi-devel

Download and install Skyline

cd /opt
sudo git clone https://github.com/etsy/skyline.git
cd /opt/skyline

sudo pip install -U six
sudo pip install -r requirements.txt
sudo pip install numpy
sudo pip install scipy
sudo pip install pandas
sudo pip install patsy
sudo pip install statsmodels
sudo pip install msgpack-python

Some of the Python packages might take very long to compile. Be patient, maybe grab a coffee…

Configure Skyline

Don’t forget to create the required directory and configuration:

sudo cp /opt/skyline/src/settings.py.example /opt/skyline/src/settings.py
sudo mkdir /var/log/skyline
sudo mkdir /var/run/skyline
sudo mkdir /var/log/redis
sudo mkdir /var/dump/

Skyline requires some settings, so edit the file:

sudo vim /opt/skyline/src/settings.py

and replace the following with your own values:

GRAPHITE_HOST = ‘YOUR_GRAPHITE_HOST‘
HORIZON_IP = ‘0.0.0.0‘
WEBAPP_IP = ‘YOUR_SKYLINE_HOST_IP‘

In my case I replace with the same IP as Skyline and Graphite run locally on the same host:

GRAPHITE_HOST = ‘192.168.50.6‘
HORIZON_IP = ‘0.0.0.0‘
WEBAPP_IP = ‘192.168.50.6‘

Start Skyline

The Skyline stack is made up of a Redis database and three Python processes. Start all services:

cd /opt/skyline/bin

sudo redis-server redis.conf
sudo ./horizon.d start
sudo ./analyzer.d start
sudo ./webapp.d start

Access Skyline-WebApp

Open your browser at http://localhost:1500/
Warning! To access a remote IP such as http://remote_ip:1500/ you need to set rules into the CentOS default firewall. Or simply disable the firewall:

#disable firewall
sudo systemctl disable firewalld
sudo systemctl stop firewalld

Skyline-WebApp should be empty as no anomaly will have been detected at first. Don’t worry if some anomalies are listed. This will stabilize with time.

Send Metrics to Skyline and Graphite

Carbon-relay now forwards its data to Skyline and carbon-cache. Any metrics sent to carbon-relay should be available in Skyline and Graphite.

Install CollectD

To collect and push metrics to carbon-relay we like to use CollectD. If you haven’t installed CollectD yet, follow these easy instructions:

Configure CollectD

Carbon-cache listens on a different port. Make sure you edit /opt/collectd/etc/collectd.conf to send the data on port 2013.

sudo vim /opt/collectd/etc/collectd.conf

# edit the file
[...]

   Host "localhost"
   Port "2013"
   Prefix "collectd."
   Protocol "tcp"

[...]

Start CollectD

If you installed the systemd script simply run:

# start with SytemD
sudo systemctl restart collectd.service

# or manualy
sudo /opt/collectd/sbin/collectd

Test Skyline Integration

Let your Linux idle for at least 10 minutes. Skyline will train itself to recognize low CPU activity as being normal. To create an anomaly, suddenly create some high CPU activity:

# wait at least 10 minutes before creating the anomaly
timeout 40s dd if=/dev/zero of=/dev/null &
timeout 40s dd if=/dev/zero of=/dev/null &
timeout 40s dd if=/dev/zero of=/dev/null &
timeout 40s dd if=/dev/zero of=/dev/null &

Open Skyline-webApp in your browser on port 1500. It should detect the anomaly within 10 seconds.

Test Graphite

Graphite should still work the same as before, since it receives its metrics through carbon-relay. Wait a few seconds and open Graphite-WebApp. Same as previously, you should see some new metrics in the left panel. Open one of them to render it.
Warming! By default, Graphite-WebApp renders a 24 hour graph. To see some data points zoom in or wait a bit longer.

Troubleshooting

If you follow the above steps exactly, this should work just fine. But in case you don’t see your data in Graphite try the following.

Check Graphite Troubleshooting

Try the troubleshooting solutions provided in Graphite installation tutorial.

Check Skyline log files

less +F /var/log/skyline/horizon.log
less +F /var/log/skyline/analyzer.log
less +F /var/log/skyline/webapp.log

Check That Every Process is Running

ps aux | grep "carbon-ca\|httpd\|collectd\|(wsgi:graphite)\|horizon-agent.py\|analyzer-agent.py\|horizon-agent.py\|carbon-relay.py\|webapp.py"

# result somthing like:
# [...]
# sudo /opt/collectd/sbin/collectd -C /opt/collectd/etc/collectd.conf -f
# [...]
# python /opt/skyline/bin/../src/horizon/horizon-agent.py start
# [...]
# python /opt/skyline/bin/../src/analyzer/analyzer-agent.py start
# [...]
# python /opt/skyline/bin/../src/webapp/webapp.py restart
# /usr/sbin/httpd -DFOREGROUND
# [...]
# /bin/python bin/carbon-relay.py --instance=a start
# /bin/python bin/carbon-cache.py --instance=a start
# [...]

Conclusion

Skyline is the first of its kind: it detects anomalies and trigger alarms in real time. Skyline is definitively trying to build a better world, where DevOps don’t need to spend their time watching metrics!

But there are some drawbacks. It only detects very obvious anomalies. This isn’t too bad, as many anomalies are obvious, but it will still fail to detect complex anomalies. the idea was to build a solution you can extend with homemade detectors, but no third party detectors have yet been released, so we are stuck with the basic functionality.

Of course I’m biased, but for easier monitoring and detection of many other anomaly types, I recommend using our product “Anomaly.io”. We believe it beats Skyline in every way.

The post Detecting Anomalies with Skyline appeared first on Anomaly.

Anomaly Detection Using K-Means Clustering

martin-magakian — Tue, 30 Jun 2015 12:58:19 +0000

Monitored metrics very often exhibit regular patterns. When the values are correlated with the time of the day, it’s easier to spot anomalies, but it’s harder when they do not. In some cases, it is possible to use machine learning to differentiate the usual patterns from the unusual ones.

Regular Patterns

Many metrics follow regular patterns correlated to time. Websites, for example, commonly experience high activity during the day and low activity at night. This correlation with time makes it easy to spot anomalies, but it’s more difficult when that correlation is not present.

1. With Time Correlation

When the metric is correlated to time, the key point is to find its seasonality. When today’s pattern is the same as yesterday, the seasonality is daily. It is also common to find seasonality of one week because Saturday’s patterns often don’t follow Friday’s, but rather those of the Saturday of the previous week. Because many metrics are correlated to human activity, seasonality is often weekly. But it could also be measured in seconds, minutes, hours, month or an arbitrary period.

Subtracting the usual pattern (seasonality) from the overall signal should create a new “difference metric”. Due to randomness, the resulting signal should be pure noise with a mean of zero. Under such conditions, we can apply the normal distribution to detect anomalies. Any values that are too high or too low can be treated as anomalous.

2.Without Time Correlation

A heartbeat has many recurring patterns. The heart beats on average every 0.8s, but this is an average period, not a fixed period of time. Moreover, the period and the value of the signal might change a lot due to physical activity, stress or other effects. So it isn’t possible to just use a period of 0.8 seconds.

So what can we do? Machine learning to the rescue!

Machine Learning Detection

We can group similar patterns into categories using machine learning. Then, we subtract each new beat with its closest category. If the original beat and the category beat are very similar, the result should be pure noise with a mean of zero. As above, it’s now possible to apply the normal distribution to detect anomalies. For an anomalous beat, even the closest category will still be very different. The anomaly will be easy to detect as it will create a peak in the “difference metric”.

This requires 4 steps:

Sliding Window
Clustering
Noise Transform
Detect Anomalies

1. Sliding Window

The first step is to “cut” the normal heartbeat signal into smaller chunks. To get better results later, we are going to use a sliding window with an overlap of 50%. For each window, we will apply a Hamming function to clean the signal.
In the following heartbeat implementation, we choose a sliding windows of 60 values represented with a red rectangle. So, we will move the window by 30 values. For each window, we multiply the heartbeat signal (in green) by the hamming function (in red). The end result (in green) is saved to create categories in the next step.

2. Clustering

The next step is to group together similar patterns produced by the sliding window. We will use one machine learning technique known as k-means clustering using Matlab/Octave or Mahout. This will cluster our signal into a catalogue of 1000 categories. In the following schema, some categories are plotted.

3. Noise Transform

For each sliding window, we select the closest category from our catalogue of patterns created previously. Subtracting the sliding window from its category should produce a simple noisy signal. If the signal is repeating and enough clusters were chosen, the whole signal can be turned into a normally distributed noisy signal with a mean of zero . Because this noisy signal is normally distributed we can apply the normal distribution to detect anomalies. We can plot its histogram, or use the 99.999% percentile, while highlighting the threshold limit to apply on the noisy signal.

4. Detect Anomalies

For each new sliding window, we repeat the above process to turn it into a noise signal. Previously we defined minimum and maximum thresholds. The noisy sliding window should be within those limits; if not, this is an anomaly! But that’s not all. We also defined the noise signal as being normally distributed with a mean of zero. If it suddenly changes into another distribution, something unusual or anomalous is appending.

Min / Max anomaly

distribution anomaly

Implementation

1. Requirement

First, you need to download and install Octave (recommended) or Matlab. A few packages are will be used in this tutorial. Open an Octave or Matlab shell, install and import these packages:

% Install package
pkg install -forge io
pkg install -forge statistics
pkg install -forge control
pkg install -forge signal

% Import package
pkg load signal
pkg load statistics

2. Sliding Window

You could use the WFDB Toolbox from physionet.org to download a heartbeat signal and convert it into Octave/Matlab format. But as it’s easier to simply download the converted file “16773m.mat”

Load the signal in Octave/Matlab , convert into sliding windows, and apply the Hamming function:

% change with your own path
cd /path/to/16773m.mat

% load & plot some heartbeat
signal = load("16773m.mat").val(1,:);
plot(signal(1:1000))

% convert to sliding windows
% http://www.mathworks.com/help/signal/ref/buffer.html
windows = buffer(signal, 60, 30);

% apply hamming function
% http://www.mathworks.com/help/dsp/ref/windowfunction.html
mutate = windows.*hamming(60);
points = mutate.';
cleanPoints = points(2:end-1, :);

% save for Mahout
dlmwrite('heatbeat.txt',cleanPoints,'delimiter',' ','precision','%.4f');

All sliding windows are saved in memory into the “cleanPoints” variable and also the “heatbeat.txt” text file. The variable will be used for clustering into Octave/Matlab; the text file will be used for clustering using Mahout.

3. Clustering

Let’s now find common patterns from the signal. We will use the k-means clustering technique, which is part of the machine learning field. It works by grouping together points based on their nearest mean.

3.1. Clustering with Octave or Matlab

When data can fit into RAM, Octave or Matlab is a good choice. When clustering a small quantity of data, such as this heartbeat signal, you should use Octave or Matlab.

Octave and Matlab come with a k-means implementation in the statistics package. As it’s quite slow we will use a faster implementation call Fast K-means. Download fkmeans.m and save the file into the directory of your Octave / Matlab workspace. Use the “pwd” command to find your workspace path.

If you save fkmeans.m in your current workspace, you should be able to create 1000 clusters:

[~, clusters, ~] = fkmeans(cleanPoints, 1000);

It should take around 30 minutes, so let’s save the file, just in case.

save "1000_clusters_fkmeans.mat" clusters -ascii;

If you have any trouble creating this file, you can download 1000_clusters_fkmeans.mat and simply load it within Octave or Matlab:

clusters = load("1000_clusters_fkmeans.mat");

3.2. Clustering with Mahout

When the data you went to cluster is huge and won’t fit into memory, you should choose a tool designed for this use case, such as Mahout.

First, download and install Mahout. Previously we created a file call “heatbeat.txt” containing all the sliding windows. You have to copy this file into a directory call “testdata”. Then, run the Mahout clustering from your shell:

mkdir testdata
cp /path/to/heatbeat.txt testdata/
mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job --input testdata --output out_1000 --numClusters 1000 --t1 200 --t2 100 --maxIter 50

A few hours later, the clustering result will be written into the “out_1000″ directory. Now export the Mahout result into Octave or Matlab. Let’s use cluster2csv.jar to convert this Mahout file into a CSV file.

java -jar cluster2csv.jar out_1000/clusters-50-final/part-r-00000 > 1000_clusters_mahout.csv

Finaly, let’s import 1000_clusters_mahout.csv into Octave or Matlab:

clusters = load("1000_clusters_fkmeans.mat");

4. Noise Transform

To transform the original signal into a noise signal, we need a few steps. First, we need a function that can find the closest cluster for a given signal window:

function clusterIndices = nearKmean(clusters, data)
numObservarations = size(data)(1);
K = size(clusters)(1);
D = zeros(numObservarations, K);
for k=1:K
 %d = sum((x-y).^2).^0.5
 D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
[minDists, clusterIndices] = min(D, [], 2);
end

We need to subtract the window from the closest cluster:

function diffWindows = diffKmean(clusterIndices, clusters, data)
diffWindows = [];
for i = 1:length(clusterIndices)
 entry = data(i,:);
 cluster = clusters(clusterIndices(i),:);
 diff = cluster-entry;
 diffWindows = [diffWindows; diff];
end
end

And finally, a function to reconstruct the “difference signal”, also call the “error signal”:

function reconstruct = asSignal(diffWindows)
rDiffWindows = diffWindows.';
p1 = rDiffWindows(1:size(rDiffWindows,1)/2,2:end);
p2 = rDiffWindows((size(rDiffWindows,1)/2)+1:size(rDiffWindows,1),1:end-1);
reconstruct = reshape(p1.+p2, [], 1)(:)';
end

With a sample, let’s apply the above function to draw the error signal:

samples = cleanPoints(20200:20800, :);
clusterIndices = nearKmean(clusters, samples);
diffWindows = diffKmean(clusterIndices, clusters, samples);
reconstructDelta = asSignal(diffWindows);
plot(reconstructDelta)
ylim ([-200 600])

5. Detect Anomaly

Let’s simulate the beginning of a heart attack and detect the anomaly.

attack = cleanPoints(10002, :);
attack(35:43) = [100,350,325,250,410,380,300,200,100];
clusterId = nearKmean(clusters, attack);
cluster = clusters(clusterId,:);

plot(attack, "linewidth",1.5, 'Color', [0.1484375,0.7578125,0.71484375]);
ylim([-200 500]);
hold on;
plot(cluster, "linewidth",1.5, 'Color', [0.875486381,0.37109375,0.24609375]);
hold off;

Clearly, there is a huge difference between the heart attack and its closest pattern. This plot highlights the anomaly.

diff = attack-cluster;
plot(diff, "linewidth",1.5, 'Color', "black");
ylim([-200 500]);

Previously we established more rules than a simple minimum and maximum. The noisy “error signal” should follow a mean of zero with a specific normal distribution. Let’s simulate this kind of dysfunction.

p1 = cleanPoints(10001:10020, :);
p2 = cleanPoints(10020:10040, :);
p2 = p2.-(p2*1.2)+15;
samples = [p1; p2];
clusterIndices = nearKmean(clusters, samples);
diffWindows = diffKmean(clusterIndices, clusters, samples);
reconstructDelta = asSignal(diffWindows);
plot(reconstructDelta, "linewidth",1.5, 'Color', "black");
ylim([-200 500]);

It doesn’t look normally distributed anymore: for sure, it’s not the same distribution as it usually is. It’s an anomaly.

clusterIndices = nearKmean(clusters, p2);
diffWindows = diffKmean(clusterIndices, clusters, p2);
reconstructDelta = asSignal(diffWindows);
plot(reconstructDelta, "linewidth",1.5, 'Color', "black");
hist(reconstructDelta, 50);

Conclusion

This technique is useful only under certain conditions. You should apply this technique if the signal doesn’t repeat itself within a fixed period of time. If it does, such as daily or weekly, it is preferable to simply model the seasonality to detect anomalies.

We have only scratched the surface here when studying the error signal as being normally distributed. In reality there are more ways of detecting anomalies in a normally distributed signal.

The post Anomaly Detection Using K-Means Clustering appeared first on Anomaly.

Delta Time (Δt) for Anomaly Detection

martin-magakian — Wed, 17 Jun 2015 08:46:33 +0000

It is common to monitor the number of events that occur in a period of time. Unfortunately, this technique isn’t fast, and can fail to detect some anomalies. The alternative is to change the problem to studying the period of time between events.

Poisson Distribution Doesn’t Work

Traditional monitoring records the number of events for a period of time. There are two major problem with this approach:

Due to randomness, it is difficult to spot an anomaly
When the event rate is low, detection becomes very slow

Previously, we went into details and solved both problems by studying the period of time between events using the Poisson distribution to detect anomalies (please read this article, if you haven’t yet, before continuing). To apply the Poisson distribution we need to check if we are monitoring a Poisson process. For this to be the case, two conditions must hold.

1. Events need to occur independently

In server monitoring, events don’t always occur independently due to the underlying architecture. For example, when a user writes to a database table, other users can’t access the same table at the same time. They must wait in a queue. When the writer finishes, the queue is processed. This clearly shows a dependency of events so the Poisson distribution can’t be applied.

2. Average rate does not change

It is very common to see seasonal patterns in monitoring. Many servers are very busy during the day and much less at night. This change in rate doesn’t follow the Poisson process. A way to deal with this problem is to compare the average rate one week ago at the same time to predict today’s rate.

Using Percentile of Δt

Poisson anomaly detection was a perfect fit for analyzing time between events. Unfortunately, it doesn’t work in our case because our data aren’t Poisson distributed. Should we stop studying event rates and go back to event counts? Wait! Don’t give up yet, because there is a simple trick we can use. We are actually only interested in extremely rare event rates. So it doesn’t matter if this data isn’t Poisson distributed or normally distributed to detect anomalies. What really matters is to have a lower tail and/or upper tail with a very rare probability distribution.

The main idea is to monitor the interval of time between the event at time t and the previous event at time t_-1. The time difference between both events is known as Δt = t – t_-1.We can also monitor the time difference for the n previous events, which is Δt = t – t_-n. Using n=1 will provide the fastest detection. Using n>1 allows detection of smaller rate variations, but requires some delay since it needs to wait for n events before anomaly detection.

Download the data set recording event timestamps in milliseconds and the computed n=1, n=5 and n=10.

The n=1 histogram doesn’t show any tail. Based on the distribution, the min and max values look impossible. But, as we don’t know the nature of the data, we will still apply min and max thresholds as a precaution.
Interestingly, the n=5 and n=10 histograms do show a tail. It is neither Poisson distributed nor normally distributed, but extremely low or high values can be considered anomalies.

Using the 0.001 and 0.999 percentiles:

for n=5, the percentiles are min=332 and max=1496
for n=10, the percentiles are min=354 and max=1304

Please note that in this example data set the first 1500 events are normal before an anomaly append.

v = read.csv('delta-record.csv', sep=';', stringsAsFactors=F);
time = c(as.numeric(v[, 'time']));
n1 = c(as.numeric(v[, 'n1']));
n5 = c(as.numeric(v[, 'n5']));
n10 = c(as.numeric(v[, 'n10']));

qn1 = quantile(head(n1,1500), probs=c(0.001, 0.999), na.rm=T)
qn5 = quantile(head(n5,1495), probs=c(0.001, 0.999), na.rm=T)
qn10 = quantile(head(n10,1490), probs=c(0.001, 0.999), na.rm=T)

Draw the data Δt and thresholds for different n:

color = c("black", "#26c2b7", "#e05d38");
plot(x= tail(time, 150), y = tail(n1, 150), type = "l", ylim=c(0,2100), lwd=3, xlab = "time", ylab = "Δt");
lines(x= tail(time, 150), tail(n5, 150),col=color[2], lwd=3);
lines(x= tail(time, 150), tail(n10, 150),col=color[3], lwd=3);

abline(h=qn1, col=color[1], lwd=2);
abline(h=qn5, col=color[2], lwd=2);
abline(h=qn10, col=color[3], lwd=2);

legend("topright", col=color, lwd=2, legend=c("n=1","n=5","n=10"));

Note: Aggregated value was divided by n to appear on the same graph. The logic stay unchanged.

A small anomalous rate change occurs at the end of the graph:

As discussed before, n=1 is very reactive, but it fails to detect the anomaly as every value are above its minimum limit.
At the end of the graph, n=5 goes under its thresholds. It requires more delay than n=1 but at least it detected the anomalous change in the event rate.
n=10 required even more time to detect the anomaly. It isn’t useful in this example, because it is far too slow. However, it could have been used to detect a change in rate that n=5 wouldn’t have found.

As already explained, n=1 provide the fastest detection, but fails to detect smaller unusual rate variations. The bigger n is, the more anomalies it will detect by smoothing the variations, unfortunately at the price of speed.

Including Event Rate λ

Usually, some change in rate will be noticed during monitoring. The event rate might even change a lot between day and night. Previously, we omitted such seasonality to calculate the thresholds, but as it happens, it is easy to integrate the event rate λ and to normalize the result. It is as simple as multiplying the event rate λ by the events’ time difference Δt. The result is σ = λ * Δt. The value σ is just a normalized version of Δt, so we can apply the percentile technique exactly how we did before.

The main difficulty is to have a correct estimate of λ. Based on the metric history, it might be possible to pick the previous rate exactly one week from now; this can work perfectly in some cases. In complex cases, picking λ might be hard, but that is beyond the scope of this article.

As explained in the chart, for each incoming event t :

Select the event rate λ as it was one week ago
Calculate the time difference Δt between event t and event t_-n
Multiply Δt and λ to find σ
if σ is lower than the 0.001 percentile or higher than 0.999 percentile the event is an anomaly!

Implementation

Previously we learned how to make a Δt anomaly detection. Let’s now build a simple prototype in Java 8 using Joda time and CircularFifoQueue from Apache collections4. CircularFifoQueue is a “first in first out” queue that retains only the last x elements inserted: very useful for storing our n events.

As explained above, when an event occurs, we check if the rate stays between the min and max percentile. But how do we detect a total cessation, when no events occur at all? The trick is to check the queue every 10 ms by introducing a virtual event with the current date. This virtual event doesn’t get stored in the queue; it is only used to check every 10 ms if the rate goes too low. In the following implementation, this is the autoCheck.

The Detection class takes three arguments:

List min and max percentile of n
RateHistory predicts the current rate λ
autoCheck checks every 10 ms if the min rate is reached

The heart of the code is in the small detect function that calculates and checks σ from Δt and λ.

//Detection.java
import java.util.List;
import java.util.Date;
import java.util.Timer;
import java.util.TimerTask;
import org.apache.commons.collections4.queue.CircularFifoQueue;
import org.joda.time.Interval;

public class Detection {
  private CircularFifoQueue queue;
  private List rules;
  private RateHistory rateHistory;

  public Detection(List rules, RateHistory rateHistory, boolean autoCheck){
    this.rules = rules;
    this.rateHistory = rateHistory;
    queue = new CircularFifoQueue(rules.get(rules.size()-1).n);
    if(autoCheck){
      autoCheckEveryMillis(10);
    }
  }

  private void autoCheckEveryMillis(int millis) {
    Detection me = this;
    new Timer().schedule(new TimerTask() {
      public void run()  {
        me.checkAllNoEvent();
      }
    }, 0, millis);
  }

  public void newEvent(){
    queue.add(new Date());
    checkAllEvent();
  }

  public void checkAllEvent(){
    for(Rule rule : this.rules){
      if(queue.size() >= rule.n){
        checkEvent(rule);
      }
    }
  }

  public void checkAllNoEvent(){
    for(Rule rule : this.rules){
      if(queue.size() >= rule.n){
        checkNoEvent(rule);
      }
    }
  }

  private void checkEvent(Rule rule){
    Date last = queue.get(queue.size()-1);
    Date previous = queue.get(queue.size()-rule.n);
    detect(rule, last, previous);
  }

  public void checkNoEvent(Rule rule){
    Date last = new Date();
    Date previous = queue.get(queue.size()-rule.n+1);
    detect(rule, last, previous);
  }

  private void detect(Rule rule, Date last, Date previous){
    float deltaT = delta(previous, last);
    float omega = this.rateHistory.predictCurrent() * deltaT;
    if(rule.minPercentile > omega){
      System.out.println("ANOMALY - rate to low - found by n="+rule.n);
    }else if(omega > rule.maxPercentile){
      System.out.println("ANOMALY - rate to high - found by n="+rule.n);
    }
  }

  private long delta(Date from, Date to){
    Interval interval = new Interval(from.getTime(), to.getTime());
    return interval.toDurationMillis();
  }
}

//Rule.java
import java.util.Date;

public class Rule {
  public Date date = new Date();
  public int n;
  public float minPercentile;
  public float maxPercentile;

  public Rule(int n, float min, float max){
    this.n = n;
    this.minPercentile = min;
    this.maxPercentile = max;
  }
}

//FakeRate.java
public class FakeRate implements RateHistory {
  private boolean isNight = false;

  @Override
  public int predictCurrent() {
    if(isNight){
      return 1;
    }else{
      return 10;
    }
  }

  public void setInNightTime(){
    isNight = true;
  }

  public void setInDayTime(){
    isNight = false;
  }

}

//RateHistory.java
public interface RateHistory {
  public int predictCurrent();
}

//Rule.java
import java.util.Date;

public class Rule {
  public Date date = new Date();
  public int n;
  public float minPercentile;
  public float maxPercentile;

  public Rule(int n, float min, float max){
    this.n = n;
    this.minPercentile = min;
    this.maxPercentile = max;
  }
}

Let’s try this code with two already calculated n:

n=5 with a min and max percentile 3650 and 4750
n=10 with a min and max percentile 8100 and 9800

For this example, we created a FakeRate class. It will predict a rate of λ=10 events during daytime and λ=1 events during nighttime.

//Main.java
import java.util.ArrayList;
import java.util.List;

public class Main {
  private void sendEvent(Detection detection, int rate) 
      throws InterruptedException {
    for(int i=0; i<10; i++){
      Thread.sleep(rate);
      detection.newEvent();
    }
  }

  public void run() throws InterruptedException {
    List rules = new ArrayList();
    rules.add(new Rule(5, 3650, 4750));
    rules.add(new Rule(10, 8100, 9800));
    FakeRate rate = new FakeRate();
    Detection detect = new Detection(rules, rate, false);

    System.out.println("NIGHT TIME");
    rate.setInNightTime();
    sendEvent(detect, 1000);
    System.out.println("* set anomaly too slow...");
    sendEvent(detect, 850);
    System.out.println("* back to normal");
    sendEvent(detect, 1000);
    System.out.println("* set anomaly too fast...");
    sendEvent(detect, 1100);
    System.out.println("\n\rDAY TIME");
    rate.setInDayTime();
    System.out.println("* back to normal");
    sendEvent(detect, 100);
    System.out.println("* set anomaly too fast...");
    sendEvent(detect, 75);

    // Check rules break down
    System.out.println("\n\rWITH AUTOCHECK");
    Detection detect2 = new Detection(rules, rate, true);
    sendEvent(detect2, 1000);
    System.out.println("* set anomaly break...");
    Thread.sleep(110000);
  }

  public static void main(String[] args) throws Exception {
    new Main().run();
  }
}

NIGHTTIME
* set anomaly too slow...
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=10
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=10
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=10
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=10
* back to normal
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=10
ANOMALY - rate to low - found by n=10
* set anomaly too fast...
ANOMALY - rate to high - found by n=10
ANOMALY - rate to high - found by n=10
ANOMALY - rate to high - found by n=10


DAYTIME
* back to normal
ANOMALY - rate to high - found by n=5
ANOMALY - rate to high - found by n=10
ANOMALY - rate to high - found by n=5
ANOMALY - rate to high - found by n=10
ANOMALY - rate to high - found by n=5
ANOMALY - rate to high - found by n=10
ANOMALY - rate to high - found by n=10
ANOMALY - rate to high - found by n=10
ANOMALY - rate to high - found by n=10
ANOMALY - rate to high - found by n=10
ANOMALY - rate to high - found by n=10
* set anomaly too fast...
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=10
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=10
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=10
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=10
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=10
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=10
ANOMALY - rate to low - found by n=5
ANOMALY - rate to low - found by n=10


WITH AUTOCHECK
ANOMALY - rate to high - found by n=5
ANOMALY - rate to high - found by n=5
ANOMALY - rate to high - found by n=5
ANOMALY - rate to high - found by n=5
ANOMALY - rate to high - found by n=5
ANOMALY - rate to high - found by n=5
ANOMALY - rate to high - found by n=5
ANOMALY - rate to high - found by n=5
ANOMALY - rate to high - found by n=5
ANOMALY - rate to high - found by n=5
ANOMALY - rate to high - found by n=5
[...]
ANOMALY - rate to high - found by n=10
[...]

First anomaly:

n=5 detects the anomaly quite fast
n=10 requires 4 extra events to detect it

Second anomaly:

The change in rate is too small for n=5 to detect
n=10 detects the problem after 12 (small) anomalous events

When transiting to “daytime,” many anomalies are detected. This is because our rate becomes 10x faster. In reality, this should be a smoother transition, so it shouldn’t trigger any alert.

The third anomaly:

n=5 detects the anomaly quickly
n=10 requires 2 extra events to detect it

In the last part, we create an anomaly detection without sending any event. Since the autoCheck is enabled (as it always should be), it detects the low rate within 10 ms.

Conclusion

As discussed in detail when building an anomaly detection using the Poisson-distribution, focusing on the event rate will detect anomalies very quickly. Under high rates, a few milliseconds can be enough to spot an anomaly.

Anomaly detection studies outliers: It doesn’t matter what the distribution is as long as outliers are detected. When plotting the distribution, a tail or two might become visible. The upper tail can be used to detect a “too high” rate change, the lower tail a “too low” rate change. If no tail is present, choose higher values of n. Still no tail in sight? Maybe you should consider other anomaly detection techniques…

The key to success of this anomaly detection technique is to very carefully choose the rate predictor λ. As every calculation depends on the λ prediction, picking the wrong value will result in many false positives.

The post Delta Time (Δt) for Anomaly Detection appeared first on Anomaly.