Anomaly » detection https://anomaly.io Just another WordPress site Fri, 21 Dec 2018 19:16:49 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.2 Salesforce Anomaly Detection Using Anomaly.iohttps://anomaly.io/salesforce-anomaly-detection/ https://anomaly.io/salesforce-anomaly-detection/#comments Mon, 30 Jan 2017 14:42:52 +0000 https://anomaly.io/?p=3267 The post Salesforce Anomaly Detection Using Anomaly.io appeared first on Anomaly.

]]>

detect anomalies in salesforce

Monitoring Key Performance Indicators (KPIs) is essential to running a successful business. As one example, you should examine your lead generation KPI several times a day, to allow you to detect and correct problems as quickly as possible.

But you’re busyyou don’t have time to watch KPI indicators all day long. That’s where Anomaly.io comes in. By combining our detection algorithms with your Salesforce data, you can automatically detect problems and notify the appropriate personnel to ensure that speedy corrective action is taken.

From Salesforce to InfluxDB (Anomaly.io Protocol)

Let’s see how we can forward Salesforce data to Anomaly.io for automatic anomaly detection.

First, install the StreamSets Data Collector, which can easily subscribe to and stream Salesforce events. Next, log into your fresh StreamSets install with the default login/password combination “admin/admin”.
Now install the packages “Salesforce Library” and “InfluxDB 0.9+” from the PackageManager:

StreamSets package manager

Create a new Pipeline. Click the “Stage Library” (top right icon), select the origin “Salesforce”, then the processor “Groovy Evaluator”, and finally, the destination “InfluxDB”.

salesforce origin

Now link the nodes together:

Anomaly Flow Salesforce InfluxDB

Salesforce Node

This node collects historical “Opportunities” and subscribes to upcoming ones through the Salesforces Streaming API

Enter this data on the Salesforce tab:

  • Username: salesforceUserName
  • Password: MyPassword+securityToken
    e.g: MyPassword1FPeZD9z[…]
    Reset your salesforce security token
  • API version: 36.0
  • Query Existing Data: Yes
  • Subscribe for Notifications: Yes

IMPORTANT: Be sure to enter “36.0” for the API, as currently this is the only compatible version.

On the Query tab:

  • Use Bulk API: Yes
  • SOQL Query:
    SELECT Id, AccountId, CreatedDate, CloseDate, Amount
    FROM Opportunity WHERE Id > ‘${OFFSET}’ ORDER BY Id

And finally, on the Subscribe tab:

  • Push Topic: OpportunityCreate

Groovy Node

This node transforms Salesforce output into InfluxDB input. You need to do the following:

  • Set the required AccountId, if it is missing.
  • Translate Salesforce time into InfluxDB time.
  • Set a “measurement” to save the events into.

Example code for the Groovy tab:

for (record in records) {
 try {
  record.value['measurement'] = 'opportunity'
  Date date = Date.parse("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'", record.value['CreatedDate'])
  record.value['CreatedDate'] = date.getTime()
  if(record.value['AccountId'] == null){
   record.value['AccountId'] = "unknow"
  }
  output.write(record)
 } catch (e) {
  log.error(e.toString(), e)
  error.write(record, e.toString())
 }
}

InfluxDB Node

This node sends the data to Anomaly.io, which uses the InfluxDB protocol, for automatic anomaly detection.

Enter the following on the InfluxDB tab:

  • URL: http://events.anomaly.io:8086
  • User: yourAnomalyUser
  • Password: yourAnomalyPasswd
  • Database Name: salesforce
  • Auto-create Database: Yes
  • Record Mapping: Custom Mappings
  • Measurement Field: /measurement
  • Tag Fields: /AccountId
  • Value Fields: /Amount /CreatedDate /CloseDate

Register Salesforce PushTopic

In our StreamSets configuration, the Salesforce origin will request all historical data, then subscribe to a channel we called “OpportunityCreate“. To set this up, we need to create this PushTopic in Salesforce:

  1. Login to Salesforce
  2. Open the Developer Console.
  3. Select Debug | Open Execute Anonymous Window.
  4. In the Enter Apex Code window, paste in the following Apex code, then click Execute.

PushTopic pushTopic = new PushTopic();
pushTopic.Name = 'OpportunityCreate';
pushTopic.Query = 'SELECT Id, AccountId, CreatedDate, CloseDate, Amount FROM Opportunity';
pushTopic.ApiVersion = 36.0;
pushTopic.NotifyForOperationCreate = true;
pushTopic.NotifyForOperationUpdate = false;
pushTopic.NotifyForOperationUndelete = false;
pushTopic.NotifyForOperationDelete = false;
pushTopic.NotifyForFields = 'Referenced';
insert pushTopic;

Run Streaming from Salesforce to InfluxDB (or Anomaly.io)

Click outside any node, then go to Configuration > Error Records and select “(Library: Basic)”.  Now we can finally run the pipeline:

Datasets run Anomaly Detection

You should see something like the following. Note that here a total of 32 “Opportunities” were processed, and StreamSets is waiting for new events to come in:

anomaly pipe

Salesforce historical data has now been imported into InfluxDB. What about future “Opportunities”? Let’s check to make sure the Salesforce Streaming API works:

  1.  Login to workbench.developerforce.com
  2. Go to Data > Insert | Object Type “Opportunity” | Next
  3. Enter:
    AccountId: 0010Y00000DX4EgQAL
    Amount: 100
    CloseDate: 2019-05-27T10:43:35.000Z
    Name: My Opportunity
    StageName: Open

Data should be processed within one second and then sent to InfluxDB or Anomaly.io.

Detect Anomalies in Salesforce

Using Anomaly.io

You can spend a lot of money on R&D, or you can save that money and use anomaly.io to detect exceptions and issues in your data. After a week of intense machine learning, our algorithm will pick the best anomaly detection method for your data. Our system can automatically apply techniques such as k-mean clustering, anomaly correlation, breakout detection and trend decomposition in real time, along with others.

Simply forward your data to anomaly.io; we’ll keep an eye on it.

Using Custom Anomaly Detection

Anomaly detection is complex, and especially difficult when you need to work with unknown problems. Your users and business will evolve, so you can’t rely on just setting simple thresholds.

Before you start, always check to make sure that your data is compatible with the algorithms you are using. For example, a common error is to use an algorithm intended for normally distributed time series when your data isn’t normally distributed.

You can also forward your Salesforce data into an open source solution such as Twitter Anomaly Detection, Twitter Breakout, or Skyline, or develop your own.

The post Salesforce Anomaly Detection Using Anomaly.io appeared first on Anomaly.

]]>
https://anomaly.io/salesforce-anomaly-detection/feed/ 0
Detecting Anomalies in Correlated Time Serieshttps://anomaly.io/detect-anomalies-in-correlated-time-series/ https://anomaly.io/detect-anomalies-in-correlated-time-series/#comments Wed, 25 Jan 2017 15:22:24 +0000 https://anomaly.io/?p=3232 The post Detecting Anomalies in Correlated Time Series appeared first on Anomaly.

]]>

anomaly in time series correlation

Monitoring key performance indicators (KPIs), sales or any other product data means working within an ecosystem where very often you will see metrics correlating with each other. When a normal correlation between two metrics is broken, we have reason to suspect something strange is happening.

As an example, take a look at anomaly.io analytics during its early days (a long time ago). In the graphic above, the new users are shown in green and the returning users in red. Clearly, something strange happens in the middle of November. Let’s use some techniques to find out more!

Step 1: Collect the Data

With Google Analytics it’s fairly easy to download data on new and returning visitors. Here at Anomaly.io we detect problems in real time, so we prefer streaming our Google Analytics in real time. But for the sake of simplicity, we will use a plain CSV file for this exercise.

You can download your own Google Analytics data, or use my sample monitoring data file if you wish: new-vs-returning-visitor.csv.

google-analytics-correlation

#R language
install.packages("dplyr")
library(dplyr)

#load data
csv = read.csv("/path/to/new-vs-returning-visitor.csv")

#load new visitor
new = filter(csv,Segment=="New Users")
new_data = new[4]$Sessions
new_date = as.Date(lapply(new[1], as.character)$Day_Index, "%m/%d/%y")
new_df = data.frame(date=new_date, value=new_data)

#load returning visitor
return = filter(csv,Segment=="Returning Users")
return_data = return[4]$Sessions
return_date = as.Date(lapply(return[1], as.character)$Day_Index, "%m/%d/%y")
return_df = data.frame(date=return_date, value=return_data)

#plot
plot_df = function(new_df, return_df = NULL, ylim=c()) {
	plot(value ~ date, new_df, xaxt="n", type="l", col="#00c7b0", ylim=ylim, lwd=2)
	date_axis = new_df$date[1:(length(new_df$date)/30.5)*30.5]+2
	axis(1, date_axis, format(date_axis, "%b"), cex.axis = .7)
	if(!is.null(return_df)) {
		lines(value ~ date, return_df, xaxt="n", type="l", col="#ff0816", lwd=2)
	}
}
plot_df(new_df, return_df)
legend("topleft", c("new visitor","returning visitor"),
      col=c("#00c7b0","#ff0816"), lty=c(1,1))

anomaly correlation R

Now you have two variables in your R environment:

  • new_df: new visitors
  • return_df: returning visitors

Step 2: Verify the Cross-Correlation

Our analysis is based on the time series being correlated, so before going any further, let’s ensure that this is the case. To do so, we need to check using Cross-Correlation. Check this article if you want to Understand the Cross-Correlation algorithm.

BONUS: Find correlated metrics in multiple times series

library(stats)
library(forecast)
ma_new_data = ma(new_data, 3)
ma_new_data = ma_new_data[!is.na(ma_new_data)]
ma_return_data = ma(return_data, 3)
ma_return_data = ma_return_data[!is.na(ma_return_data)]

corr = ccf(ma_new_data, ma_return_data, lag.max = 0)
corr # display 0.876

As the Moving Average is robust to anomaly we use it to remove potential outliers before computing the correlation. The Cross Correlation Function (CCF) is a very high value of 0.876. Clearly, the time series are correlated.

Step 3: Subtract the Time Series

We can compute that there are 1.736 times more new visitors than returning visitors. Let’s align the metrics at the same level.

multi = sum(ma_new_data/ma_return_data)/length(ma_return_data)
multi #display 1.736529
align_return_df = return_df
align_return_df$value = align_return_df$value*multi
plot_df(new_df, align_return_df)
legend("topleft", c("new visitor","returning visitor x1.736"), 
     col=c("#00c7b0","#ff0816"), lty=c(1,1))

align correlation

substract = return_df
substract$value = new_df$value - return_df$value*multi
plot_df(substract)

subtract

Step 4: Find Outliers in Correlated Time Series

The resulting signal looks like “noise”. Plotting the histogram show a normally distributed signal. We already know how to detect anomalies in a normally distributed time series using the 3-sigma rule.

hist(substract$value, breaks = 50)

histogram normal distribution

The 3-sigma rules express that nearly all values are taken to lie within three standard deviations of the mean. Everything outside this range can be treated as anomalous.

#compute 3 sigma minimal & maximal
min = mean(substract$value, na.rm = T) - 3*sd(substract$value, na.rm = T)
max = mean(substract$value, na.rm = T) + 3*sd(substract$value, na.rm = T)

# plot metric & limit
plot_df(substract, NULL, c(-120,320))
abline(h=max, col="#e15f3f", lwd=2)
abline(h=min, col="#e15f3f", lwd=2)

# plot cercle
position = data.frame(id=seq(1, length(substract$value)), value=substract$value, date=substract$date)
anomalyH = position[position$value > max, ]
anomalyH = anomalyH[!is.na(anomalyH$value), ]
anomalyL = position[position$value < min, ]
anomalyL = anomalyL[!is.na(anomalyL$value), ]
anomaly = data.frame(id=c(anomalyH$id, anomalyL$id), value=c(anomalyH$value, anomalyL$value), date=c(anomalyH$date, anomalyL$date))
anomaly = anomaly[!is.na(anomaly$value), ]
points(x = anomaly$date, y=anomaly$value, col="#e15f3f")

correlation anomaly detected

Conclusion

Looking at the initial retention graph, I was able to spot 1 or 2 anomalies when in reality I had 7. I actually remember back in mid-November when I received the automatic anomaly report. I started digging into the problem, and found out that someone had shared my twitter anomaly review on Hacker News.

The post Detecting Anomalies in Correlated Time Series appeared first on Anomaly.

]]>
https://anomaly.io/detect-anomalies-in-correlated-time-series/feed/ 3
Detecting Correlation Among Multiple Time Serieshttps://anomaly.io/detect-correlation-time-series/ https://anomaly.io/detect-correlation-time-series/#comments Thu, 10 Mar 2016 14:09:53 +0000 https://anomaly.io/?p=3166 The post Detecting Correlation Among Multiple Time Series appeared first on Anomaly.

]]>

detect-correlation

To determine the level of correlation between various metrics we often use the normalized cross-correlation formula.2

Definition: Normalized Cross-Correlation

Normalized cross-correlation is calculated using the formula:

$$norm\_corr(x,y)=\dfrac{\sum_{n=0}^{n-1} x[n]*y[n]}{\sqrt{\sum_{n=0}^{n-1} x[n]^2 * \sum_{n=0}^{n-1} y[n]^2}}$$

We recommend first understanding normalized cross correlation before using it, but any statistical language, such as R, can easily compute it for you.

Correlations between 2 metrics

In the following graph, the two metrics show some correlation between each other.

correlation

# plot the graph in R
set.seed(15)
a = c(1,2,-2,4,2,3,1,0,3,4,2,3,1)
b = a + rnorm(length(a), sd = 0.4)
plot(ts(b), col="#f44e2e", lwd=3)
lines(a, col="#27ccc0", lwd=3)

Using R to compute the normalized cross-correlation is as easy as calling the function CCF (for Cross Correlation Functions). By default, CCF plots the correlation between two metrics at different time shifts. It’s easy to understand time shifting, which simply moves the compared metrics to different times. This is useful in detecting when a metric precedes or succeeds another.

# compute using the R language
corr = ccf(a,b)
corr

correlation-level

-8-7-6-5-4-3-2-1012345678
-0.011-0.1460.0610.266-0.025-0.2460.018-0.2930.979-0.2910.098-0.320-0.0430.2880.037-0.1830.083

The last R command displays the correlation between the metrics at various time shift values. As expected, the metrics are highly correlated at time shift 0 (no time shift) with a value of 0.979.

Cluster Correlated Metrics Together

We can also use the CCF function to cluster similar metrics together based how similar they are. To demonstrate this better, we will cluster metrics from a real data set of 45 graphs call “graph45.csv”.

First, we need to compute the correlating level between every possible pair of graphs. This is what the “correlationTable” function does. (To reproduce this example you must download the data set graphs45.csv).

correlationTable = function(graphs) {
  cross = matrix(nrow = length(graphs), ncol = length(graphs))
  for(graph1Id in 1:length(graphs)){
    graph1 = graphs[[graph1Id]]
    print(graph1Id)
    for(graph2Id in 1:length(graphs)) {
      graph2 = graphs[[graph2Id]]
      if(graph1Id == graph2Id){
        break;
      } else {
        correlation = ccf(graph1, graph2, lag.max = 0)
        cross[graph1Id, graph2Id] = correlation$acf[1]
      }
    }
  }
  cross
}

graphs = read.csv("graphs45.csv")
corr = correlationTable(graphs)

It took around 20 seconds to compute all the correlation possibilities between every pair of graphs. The array corr now contains the correlation table; for example, corr[4,3] gives a correlation level of 0.990 between graph4 and graph3. Such a high correlation level indicates a strong correlation between the graphs. To find metrics with sufficiently high correlation, we choose a minimum correlation level of 0.90.

Let’s find and plot all the metrics that strongly correlate with graph4:

findCorrelated = function(orig, highCorr){
 match = highCorr[highCorr[,1] == orig | highCorr[,2] == orig,]
 match = as.vector(match)
 match[match != orig]
}

highCorr = which(corr > 0.90 , arr.ind = TRUE)
match = findCorrelated(4, highCorr)
match # print 6 12 23 42 44 45  3

Success! Graph4 highly correlates with graphs 6, 12, 23, 42, 44, 45 and 3.

Let’s now plot all the graphs together:

correlate-graphs-cluster

bound = function(graphs, orign, match) {
  graphOrign = graphs[[orign]]
  graphMatch = graphs[match]
  allValue = c(graphOrign)
  for(m in graphMatch){
    allValue = c(allValue, m)
  }
  c(min(allValue), max(allValue))
}

plotSimilar = function(graphs, orign, match){
  lim = bound(graphs, orign, match)

  graphOrign = graphs[[orign]]
  plot(ts(graphOrign), ylim=lim, xlim=c(1,length(graphOrign)+25), lwd=3)
  title(paste("Similar to", orign, "(black bold)"))

  cols = c()
  names = c()
  for(i in 1:length(match)) {
    m = match[[i]]
    matchGraph = graphs[[m]]
    lines(x = 1:length(matchGraph), y=matchGraph, col=i)

    cols = c(cols, i)
    names = c(names, paste0(m))
  }
  legend("topright", names, col = cols, lty=c(1,1))
}

plotSimilar(graphs, 4, match)

Conclusion

Here at anomaly.io, finding cross-correlation is one of the first steps in detecting unusual patterns in your data. Subtracting two correlated metrics should result in an almost flat signal. If suddenly the flat signal (or the gap between the curves) hits a certain level, you can trigger an anomaly. Of course this is an oversimplification and the reality is much more complex, but it’s a good foundation to work from.

The post Detecting Correlation Among Multiple Time Series appeared first on Anomaly.

]]>
https://anomaly.io/detect-correlation-time-series/feed/ 3
Change Point Detection with Seasonal Time Serieshttps://anomaly.io/change-point-detection-seasonal/ https://anomaly.io/change-point-detection-seasonal/#comments Mon, 08 Feb 2016 10:26:58 +0000 https://anomaly.io/?p=3076 The post Change Point Detection with Seasonal Time Series appeared first on Anomaly.

]]>

change-point-seasonality

Previously, we looked at using Twitter Breakout (EDM) to detect Anomalies. As with the popular E-Divisive, EDM detects mean shift and changes in distribution. Both algorithms work with seasonal time series, but perform even better without seasonality.

Change Point Doesn’t Work (Well) with Seasonality

To demonstrate the “weakness” of change point, let’s generate some fake seasonal time series. Then we will try to detect anomalies using two different change point detection algorithms: EDM and E-Divisive.

# install E-Divisive and EDM
install.packages("Rcpp")
install.packages("ecp")
install.packages("devtools")
devtools::install_github("twitter/BreakoutDetection")
library(Rcpp)
library(ecp)
library(BreakoutDetection)

createDay = function(noise=0) {
 point=seq(0, pi, by=0.02)
 connection=sin(point)
 noise = rnorm(length(point), sd = noise)
 return(connection+noise)
}

createDays = function(totalDays, noise=0) {
 allDays = c()
 for (day in 1:totalDays ) {
 allDays = c(allDays, createDay(noise))
 }
 return(allDays)
}

set.seed(1234)
p1 = createDays(3, 0.2)
anomaly = createDays(1, 0)*2 + rnorm(158, sd = 0.07)
days = c(p1, anomaly)
plot(as.ts(days))

# EDM - fail
res = breakout(days, min.size=158, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

# E-Divisive - fail
ediv = e.divisive(as.matrix(days), min.size=158, alpha=1)
plot(as.ts(days))
abline(v=ediv$estimates,col="blue")

fail-change-point-EDM fail-change-point-E-Divisive

As we can see, due to the seasonality of the time series, traditional change point detection doesn’t work very well.

Removing the Seasonality

To use change point detection effectively, we need to remove the seasonality from our time series. And to do that, we need to know the period of the seasonality. In this case, we know the seasonality to be 158 data points per day. If we don’t know, it’s possible to calculate the seasonality using a Fourier Transform. We also need to know if the time series is multiplicative or additive. In our example, I have an additive time series, but most of the time it is multiplicative. With this information, we can now decompose to remove the seasonality. Finally, we can run the change point detection again to get a successful result.

#remove seasonal
decomposed_days = decompose(ts(days, frequency = 158), "additive")
noSeasonal = days - decomposed_days$seasonal

#remove seasonal
decomposed_days = decompose(ts(p1, frequency = 158), "additive")
noSeasonal = days - rep(decomposed_days$seasonal[0:158],4)

# EDM - success
res = breakout(noSeasonal, min.size=158, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

# E-Divisive - success
ediv = e.divisive(as.matrix(noSeasonal), min.size=158, alpha=1)
plot(as.ts(noSeasonal))
abline(v=ediv$estimates,col="blue")

success-EDM success-E-Divisive

With the seasonality removed, breakout EDM and E-Divisive work a little bit better.

The post Change Point Detection with Seasonal Time Series appeared first on Anomaly.

]]>
https://anomaly.io/change-point-detection-seasonal/feed/ 0
Anomaly Detection with Twitter Breakouthttps://anomaly.io/anomaly-detection-using-twitter-breakout/ https://anomaly.io/anomaly-detection-using-twitter-breakout/#comments Wed, 20 Jan 2016 14:06:02 +0000 https://anomaly.io/?p=2873 The post Anomaly Detection with Twitter Breakout appeared first on Anomaly.

]]>

AnomalyDetectionBreakout

We previously tested Twitter Anomaly Detection package using the R language. Now let’s take a look at Twitter Breakout Detection.

What is Twitter Breakout Detection?

This Twitter package is intended to detect changes in time series. It is describe as an E-Divisive with Medians (EDM). It is supposed to:

To detect divergence the algorithm uses a clustering technique (mean shift clustering). As EDM is non-parametric, the data doesn’t need to follow any specific distribution; it can adapt to the current distribution. As a result, it can be used to detect when the distribution changes.

We previously explained how moving medians are robust to anomalies. The 3.5x speed improvement compare to E-Divisive is in part due to the use of Interval Trees to approximate the median very efficiently. However, this isn’t the only reason for the speed increase.

Let’s try it out ourselves.

Divergence Detection (Mean Shift, Ramp Up)

Mean Shift

A sudden jump in the time series is call a mean shift and represents the time series switching from one steady state to another. A good example (used by Twitter) is to imagine CPU utilization suddenly jumping from 40% to 60%. Let’s generate a fake time series to test the detection process.

# install BreakoutDetection
install.packages("devtools")
devtools::install_github("twitter/BreakoutDetection")
library(BreakoutDetection)

set.seed(123)
p1 = rnorm(60, mean = 0, sd = .4)
p2 = rnorm(60, mean = 6, sd = .4)
p3 = rnorm(60, mean = 0, sd = .4)
p4 = rnorm(60, mean = 3, sd = .4)
p5 = rnorm(60, mean = 0, sd = .4)
p6 = rnorm(60, mean = 6, sd = .4)
p7 = rnorm(60, mean = 0, sd = .4)
all = c(p1,p2,p3,p4,p5,p6,p7)
plot(as.ts(all), col = "#27ccc0", lwd = 4)

res = breakout(all, min.size=20, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

meanShift

Indeed, we were able to detect 7 mean divergences, also called break downs.
A simple help(breakout) command tells us more about the package.

min.size: The minimum number of observations between change points.
method: Method must be one of either “amoc” (At Most One Change) or “multi” (Multiple Changes). For “amoc.” at most one change point location will be returned.
degree: The degree of the penalization polynomial. Degree can take the values 0, 1, and 2. The default value is 1.
beta: A real number constant used to further control the amount of penalization. This is the default form of penalization; it will be used if neither beta nor percent are supplied. The default value is beta=0.008.

Ramp Up

A ramp up is a slow transition from one steady state to another. An example would be CPU utilization slowly transitioning from 40% to 60% over time. Let’s detect this kind of divergence.

ramp = 40
set.seed(1234)
p1 = rnorm(100, mean = 0, sd = .4)
p2 = rnorm(ramp, mean = 6, sd = 1) * seq(0, 1, length.out = ramp)
p3 = rnorm(100, mean = 6, sd = .4)
p4 = rnorm(ramp, mean = 6, sd = 1) * seq(1, 0, length.out = ramp)
p5 = rnorm(100, mean = 0, sd = .4)
p6 = rnorm(ramp, mean = 6, sd = 1) * seq(0, 1, length.out = ramp)
p7 = rnorm(100, mean = 6, sd = .4)
p8 = rnorm(ramp, mean = 6, sd = 1) * seq(1, 0, length.out = ramp)
p9 = rnorm(100, mean = 0, sd = .4)
all = c(p1,p2,p3,p4, p5,p6,p7,p8,p9)
plot(as.ts(all), col = "#27ccc0", lwd = 2)

res = breakout(all, min.size=20, method='multi', beta=.001, degree=1, plot=TRUE)
res$plot

ramp-up

Even with the ramp up, breakdown still detects 7 distinct states. It’s very powerful to be able to see the ramp up as a transition and not as a breakdown.

Detect Changes in Distribution

EDM doesn’t make assumptions about the distribution of the time series. Instead, it learns the current distribution and uses it as a reference. When the distribution suddenly changes, EDM can detect the variation.

set.seed(123456) 
p1 = rnorm(1000, mean = 1)
p2 = rgamma(1000, shape = 1)
p3 = rpois(1000, lambda = 3)/2 - .5
p4 = runif(1000) + .5
p5 = rexp(1000)
p6 = rweibull(1000, shape = 1)

all = c(p1, p2, p3, p4, p5, p6)
plot(as.ts(all))
abline(v=seq(100,1000*6,100),col="blue")

res = breakout(all, min.size=200, method='multi', beta=0.00018, degree=1, plot=TRUE)
res$plot

distribution-change

In the example above we switch alternatively from the Normal distribution to the Gamma, Poisson, Uniform, Exponential and Weibull distributions. With some (difficult) fine tuning, EDM can detect the change from Normal to Gamma distributions and from Uniform to Exponential. Unfortunately, it doesn’t detect the change from Poisson to Uniform distribution or from Exponential to Weibull. We believe that this is due to its ability to perform in the presence of anomalies as the running Median removes outliers. Without anomalies, E-Divisive performs better. The blue line shows where E-Divisive notifies that there has been a change in distribution.

#install package
install.packages("ecp")
library(ecp)

# ~ 30 min on 2.6Ghz CPU
ediv = e.divisive(as.matrix(all), min.size=200, alpha=1)
plot(as.ts(all))
abline(v=ediv$estimates,col="blue")

distribution-change-E-Divisive

3.5× Greater Speed Than Other Breakout Detection Methods

E-Divisive detects changes in distribution as soon as they occur, but is very slow compared to the EDM algorithm. One of the reasons why EDM is much faster is due to the use of interval trees to approximate the median.

Data pointsE-DivisiveEDM
600029 min24 s
600 7 s1 s

Based on only two tests, we can see that EDM performs much faster than E-Divisive.

Robustness in the Presence of Anomalies

As explained earlier, EDM stand for E-Divisive with Medians. In a previous article, we showed how the running median is robust to anomalies. As a result, an anomaly in the time series isn’t detected as a mean shift.

set.seed(12345)
p = rnorm(20, mean = 1)
all = c(p, 25, p, 30, p, 10, p, 20, p, 70, p , 50, p) #Add Anomalies

res = breakout(all, min.size=40, method='multi', beta=0.001, degree=1, plot=TRUE)
res$plot

anomaly-not-detected

Anomalies aren’t detected as “breakouts” or mean shifts. EDM is, indeed, robust to anomalies.

The post Anomaly Detection with Twitter Breakout appeared first on Anomaly.

]]>
https://anomaly.io/anomaly-detection-using-twitter-breakout/feed/ 0
Detecting Anomalies with Moving Median Decompositionhttps://anomaly.io/anomaly-detection-moving-median-decomposition/ https://anomaly.io/anomaly-detection-moving-median-decomposition/#comments Tue, 12 Jan 2016 10:12:21 +0000 https://anomaly.io/?p=3090 The post Detecting Anomalies with Moving Median Decomposition appeared first on Anomaly.

]]>

anomaly-decomposition-median

Time series decomposition splits a time series into seasonal, trend and random residual time series. The trend and the random time series can both be used to detect anomalies. But detecting anomalies in an already anomalous time series isn’t easy.

TL;DR

When working on an anomalous time series:

 

The Problem with Moving Averages

In the blog entry on time series decomposition in R, we learned that the algorithm uses a moving average to extract the trends of time series. This is perfectly fine in time series without anomalies, but in the presence of outliers, the moving average is seriously affected, because the trend embeds the anomalies.

In this demonstration, we will first detect anomalies using decomposition with a moving average. We’ll see that this doesn’t work well, and so will try detecting anomalies using decomposition with a moving median to get better results.

About the data: webTraffic.csv reports the number of page view per day over a period of 103 weeks (almost 2 years). To make the data more interesting, we added some (extra) anomalies to it. Looking at the time series, we clearly see a seasonality of 7 days; there is less traffic on weekends.

To decompose a seasonal time series, the seasonality time period is needed. In our example, we know the seasonality to be 7 days. If unknown, it is possible to determine the seasonality of a time series. Last but not least, we need to know if the time series is additive or multiplicative. Our web traffic is multiplicative.

To sum up our web traffic:

  • Seasonality of 7 days (over 103 weeks)
  • Multiplicative time series
  • Download here: webTraffic.csv

set.seed(4)
data <- read.csv("webTraffic.csv", sep = ",", header = T)
days = as.numeric(data$Visite)
for (i in 1:45 ) {
 pos = floor(runif(1, 1, 50))
 days[i*15+pos] = days[i*15+pos]^1.2
}
days[510+pos] = 0
plot(as.ts(days))

webTraffic-anomalies

Moving Average Decomposition (Bad Results)

Before going any further, make sure to import the data. Also, I recommend being sure that you understand how time series decomposition works. The stats package provides the handy decompose function in R.

1 – Decomposition

As the time series is anomalous during the decomposition, the trends become completely wrong. Indeed, the anomalies are averaged into the trend.

# install required library
install.packages("FBN")
library(FBN)

decomposed_days = decompose(ts(days, frequency = 7), "multiplicative")
plot(decomposed_days)

moving-average-decomposition

2 – Using Normal Distribution to Find Minimum and Maximum

To the random noise we can apply the normal distribution to detect anomalies. Values under or over 4 times the standard deviation can be considered outliers.

random = decomposed_days$random
min = mean(random, na.rm = T) - 4*sd(random, na.rm = T)
max = mean(random, na.rm = T) + 4*sd(random, na.rm = T)

plot(as.ts(as.vector(random)), ylim = c(-0.5,2.5))
abline(h=max, col="#e15f3f", lwd=2)
abline(h=min, col="#e15f3f", lwd=2)

moving-average-limits

3 – Plotting Anomalies

Let’s find the values either over or under 4 standard deviations to plot the anomalies. As expected, the calculation fails to find many anomalies, due to the use of moving average by the decomposition algorithm.

#find anomaly
position = data.frame(id=seq(1, length(random)), value=random)
anomalyH = position[position$value > max, ]
anomalyH = anomalyH[!is.na(anomalyH$value), ]
anomalyL = position[position$value < min, ]
anomalyL = anomalyL[!is.na(anomalyL$value), ]
anomaly = data.frame(id=c(anomalyH$id, anomalyL$id),
 value=c(anomalyH$value, anomalyL$value))
anomaly = anomaly[!is.na(anomaly$value), ]

plot(as.ts(days))
real = data.frame(id=seq(1, length(days)), value=days)
realAnomaly = real[anomaly$id, ]
points(x = realAnomaly$id, y =realAnomaly$value, col="#e15f3f")

moving-average-anomalies

Moving Average Decomposition (Good Results)

In our second approach, we will do the decomposition using a moving median instead of a moving average. This requires understanding how the moving median is robust to anomalies and how time series decomposition works. Also, again make sure to import the data. The moving median removes the anomalies without altering the time series too much.

1 – Applying the Moving Median

The decomposed trend with moving median (see above graphic) is a bit “raw” but expresses the trend better.

#install library
install.packages("forecast")
library(forecast)
library(stats)

trend = runmed(days, 7)
plot(as.ts(trend))

moving-median-trends

2 – Using the Normal Distribution to Find the Minimum and Maximum

As the decomposition formula expresses, removing the trend and seasonality from the original time series leaves random noise. As it should be normally distributed, we can apply the normal distribution to detect anomalies. As we saw previously, values under or over 4 times the standard deviation can be considered outliers. As random time series still contain the anomalies, we need to estimate the standard deviation without taking the anomalies into account. Once again, we will use the moving median to exclude the outliers.

detrend = days / as.vector(trend)
m = t(matrix(data = detrend, nrow = 7))
seasonal = colMeans(m, na.rm = T)
random = days / (trend * seasonal)
rm_random = runmed(random[!is.na(random)], 3)

min = mean(rm_random, na.rm = T) - 4*sd(rm_random, na.rm = T)
max = mean(rm_random, na.rm = T) + 4*sd(rm_random, na.rm = T)
plot(as.ts(random))
abline(h=max, col="#e15f3f", lwd=2)
abline(h=min, col="#e15f3f", lwd=2)

moving-median-limits

3 – Plotting Anomalies

As before, values over or under 4 times the standard deviation are plotted as anomalous. This works much better; we detect almost every anomaly! Once again, this is because the moving median is robust to anomalies.

#find anomaly
position = data.frame(id=seq(1, length(random)), value=random)
anomalyH = position[position$value > max, ]
anomalyH = anomalyH[!is.na(anomalyH$value), ]
anomalyL = position[position$value < min, ]
anomalyL = anomalyL[!is.na(anomalyL$value)]
anomaly = data.frame(id=c(anomalyH$id, anomalyL$id),
 value=c(anomalyH$value, anomalyL$value))
points(x = anomaly$id, y =anomaly$value, col="#e15f3f")

plot(as.ts(days))
real = data.frame(id=seq(1, length(days)), value=days)
realAnomaly = real[anomaly$id, ]
points(x = realAnomaly$id, y =realAnomaly$value, col="#e15f3f")

moving-median-anomalies

Here is an alternative plot displaying the same results: values outside the blue area can be considered anomalous.

high = (trend * seasonal) * max
low = (trend * seasonal) * min
plot(as.ts(days), ylim = c(0,200))
lines(as.ts(high))
lines(as.ts(low))

x = c(seq(1, length(high)), seq(length(high), 1))
y = c(high, rev(low))
length(x)
length(y)
polygon(x,y, density = 50, col='skyblue')

moving-median-limits2

The post Detecting Anomalies with Moving Median Decomposition appeared first on Anomaly.

]]>
https://anomaly.io/anomaly-detection-moving-median-decomposition/feed/ 6
Detecting Seasonality Using Fourier Transforms in Rhttps://anomaly.io/detect-seasonality-using-fourier-transform-r/ https://anomaly.io/detect-seasonality-using-fourier-transform-r/#comments Thu, 06 Aug 2015 10:00:49 +0000 https://anomaly.io/?p=2803 The post Detecting Seasonality Using Fourier Transforms in R appeared first on Anomaly.

]]>

detect seasonality

Our brains are really fast at recognizing patterns and forms: we can often find the seasonality of a signal in under a second. It is also possible do this with mathematics using the Fourier transform.

First, we will explain what a Fourier transform is. Next, we will find the seasonality of a website from its Google Analytics pageview report using the R language.

Fourier Transforms Explained

The Fourier transform decomposes a signal into all the possible frequencies that comprise it. Fine, but how does it really work? (And keep it simple, please?)

 

1 – Pick a Frequency

First, the Fourier transform starts with the smallest frequency as possible. For a signal made of 100 points, the smallest frequency possible is 1/100 = 0.01 Hz. Think of a circle turning at a speed of 0.01 Hz,  or 0.01 second if the points are recorded every second. Just like a clock.

herz

1 circle turn = 0.01 s

 

2 – “Draw” the Signal Value on the Circle

Using the previously selected frequency we “draw” (decompose) the entire signal on the circle.

decompose

1 circle turn = 0.01 s

Notice: When the signal measurement is high, the clock arm of the circle is high.
The above signal will draw something like this:

draw

Signal decomposition at 0.01 Hz

 

3 – Compute the Periodogram

All the measurements are now on the circle in two dimensions. We call these “vectors”. Summing the vectors together give the final “power” of the frequency. This occurs when all the vectors line up and point in the same direction, creating high values that represent the “power” of the frequency.

4 – Repeat with Different Frequencies

To complete the Fourier transform we repeat the process with different frequencies: 0.02 Hz, 0.03 Hz … We continue until all the “power” of each possible frequency has been computed. The frequencies with the highest power represent the greatest periodicity!

Detecting Seasonality (Example)

We just saw that the frequency with the highest “power” represents the primary seasonality of our underlying metric. In this example, William Cox draws at high speed the signal value on the circle for many different frequencies.

Here is how the original signal value appears:
graph

To learn more, I recommend litening to William Cox talk about Fourier transforms:

In this example, a frequency of 2.1 gets a very high “power” value. This is where all the data line up and create a big vector.

From the frequency formula:
T = 1 / f
T = 1 / 2.1 Hz = 476 ms

The signal repeats every 476 ms. This is its seasonality!

Detecting Seasonality using R

My personal tech blog clearly shows some weekly trends: It receives much less traffic during the weekend. As a result, my Google Analytics report shows some sort of weekly periodicity. Let’s try to find the seasonality using the R language.

# Install and import TSA package
install.packages("TSA")
library(TSA)

# read the Google Analaytics PageView report
raw = read.csv("20131120-20151110-google-analytics.csv")

# compute the Fourier Transform
p = periodogram(raw$Visite)

periodogram

The periodogram shows the “power” of each possible frequency, and we can clearly see spikes at around frequency 0.15 Hz

dd = data.frame(freq=p$freq, spec=p$spec)
order = dd[order(-dd$spec),]
top2 = head(order, 2)

# display the 2 highest "power" frequencies
top2

freq spec
0.142661180 167143.1
0.002743484 109146.8

Frequencies of 0.142661180 Hz and 0.002743484 Hz show seasonality!

# convert frequency to time periods
time = 1/top2$f
time

> time
[1] 7.009615 364.500000

The main seasonality detected is 7 days. A secondary seasonality of 364.4 days was also found. Exactly what we expected! My blog has weekly seasonality as well as annual seasonality.

The post Detecting Seasonality Using Fourier Transforms in R appeared first on Anomaly.

]]>
https://anomaly.io/detect-seasonality-using-fourier-transform-r/feed/ 5
Detecting Anomalies with Skylinehttps://anomaly.io/detect-anomalies-skyline/ https://anomaly.io/detect-anomalies-skyline/#comments Tue, 28 Jul 2015 16:00:43 +0000 https://anomaly.io/?p=2761 The post Detecting Anomalies with Skyline appeared first on Anomaly.

]]>

install skyline anomaly detection

Skyline is free and open source anomaly detection software. Simply plug Graphite metrics into Skyline to detect anomalous behaviour automatically without any configuration.

Skyline Architecture

Skyline is usually set to use Graphite metrics. This is done by placing a daemon call “carbon-relay” in front of the usual Graphite stack. Carbon-relay will repeat the metrics to multiple hosts. One of them will be the essay editing service traditional Graphite stack, and the other host will be the Skyline stack.

skyline with graphite

Skyline is composed of several parts:

  • Horizon – Responsible for collecting, cleaning, and formatting incoming metrics before pushing to a Redis database.
  • Analyzer – Fetches metrics from Redis and runs mathematical equations to detect anomalies
  • Skyline-webapp – A Django webapp to display an anomaly graph when it occurs

In the above schema, CollectD is used to monitor and push server metrics to carbon-cache. Read more about Graphite architecture.

Install Graphite

Graphite Required

Skyline is commonly used with Graphite. That’s why you need to Install Graphite on CentOS.
You don’t have to run the Graphite full stack; running carbon-relay is enough. But, as carbon-relay is part of Graphite, you still have to install Graphite.

Configure Carbon-relay

Set the list of hosts carbon-relay needs to forward its metrics to.

sudo vim /opt/graphite/conf/relay-rules.conf

# change the file
[...]
destinations = 127.0.0.1:2004, 127.0.0.1:2024
[...]

With this configuration, carbon-relay will forward metrics to Skyline on port 2024 and to paper writers carbon-cache (Graphite) on port 2004. Both services run locally in this tutorial. Also, edit the carbon configuration:

sudo vim /opt/graphite/conf/carbon.conf

# change the file
[...]
DESTINATIONS = 127.0.0.1:2004, 127.0.0.1:2024
[...]

Start carbon-relay

Start carbon-relay using systemd:

sudo systemctl restart carbon-relay

Install Skyline

Required

Open a terminal and install a few tools. This includes an Apache server (which is probably already installed), the Redis database, and Python with few mathematical libraries and some compile tools:

sudo yum -y install httpd redis git
sudo yum -y install gcc gcc-c++ git pycairo mod_wsgi
sudo yum -y install python-pip python-devel blas-devel lapack-devel libffi-devel

Download and install Skyline

cd /opt
sudo git clone https://github.com/etsy/skyline.git
cd /opt/skyline

sudo pip install -U six
sudo pip install -r requirements.txt
sudo pip install numpy
sudo pip install scipy
sudo pip install pandas
sudo pip install patsy
sudo pip install statsmodels
sudo pip install msgpack-python

Some of the Python packages might take very long to compile. Be patient, maybe grab a coffee…

Configure Skyline

Don’t forget to create the required directory and configuration:

sudo cp /opt/skyline/src/settings.py.example /opt/skyline/src/settings.py
sudo mkdir /var/log/skyline
sudo mkdir /var/run/skyline
sudo mkdir /var/log/redis
sudo mkdir /var/dump/

Skyline requires some settings, so edit the file:

sudo vim /opt/skyline/src/settings.py

and replace the following with your own values:

  • GRAPHITE_HOST = ‘YOUR_GRAPHITE_HOST
  • HORIZON_IP = ‘0.0.0.0
  • WEBAPP_IP = ‘YOUR_SKYLINE_HOST_IP

In my case I replace with the same IP as Skyline and Graphite run locally on the same host:

  • GRAPHITE_HOST = ‘192.168.50.6
  • HORIZON_IP = ‘0.0.0.0
  • WEBAPP_IP = ‘192.168.50.6

Start Skyline

The Skyline stack is made up of a Redis database and three Python processes. Start all services:

cd /opt/skyline/bin

sudo redis-server redis.conf
sudo ./horizon.d start
sudo ./analyzer.d start
sudo ./webapp.d start

Access Skyline-WebApp

Open your browser at http://localhost:1500/
Warning! To access a remote IP such as http://remote_ip:1500/ you need to set rules into the CentOS default firewall. Or simply disable the firewall:

#disable firewall
sudo systemctl disable firewalld
sudo systemctl stop firewalld

Skyline-WebApp should be empty as no anomaly will have been detected at first. Don’t worry if some anomalies are listed. This will stabilize with time.

skyline screenshot

Send Metrics to Skyline and Graphite

Carbon-relay now forwards its data to Skyline and carbon-cache. Any metrics sent to carbon-relay should be available in Skyline and Graphite.

Install CollectD

To collect and push metrics to carbon-relay we like to use CollectD. If you haven’t installed CollectD yet, follow these easy instructions:

Configure CollectD

Carbon-cache listens on a different port. Make sure you edit /opt/collectd/etc/collectd.conf to send the data on port 2013.

sudo vim /opt/collectd/etc/collectd.conf

# edit the file
[...]

   Host "localhost"
   Port "2013"
   Prefix "collectd."
   Protocol "tcp"

[...]

Start CollectD

If you installed the systemd script simply run:

# start with SytemD
sudo systemctl restart collectd.service

# or manualy
sudo /opt/collectd/sbin/collectd

Test Skyline Integration

Let your Linux idle for at least 10 minutes. Skyline will train itself to recognize low CPU activity as being normal. To create an anomaly, suddenly create some high CPU activity:

# wait at least 10 minutes before creating the anomaly
timeout 40s dd if=/dev/zero of=/dev/null &amp;
timeout 40s dd if=/dev/zero of=/dev/null &amp;
timeout 40s dd if=/dev/zero of=/dev/null &amp;
timeout 40s dd if=/dev/zero of=/dev/null &amp;

Open Skyline-webApp in your browser on port 1500. It should detect the anomaly within 10 seconds.

skyline anomaly detected

Test Graphite

Graphite should still work the same as before, since it receives its metrics through carbon-relay. Wait a few seconds and open Graphite-WebApp. Same as previously, you should see some new metrics in the left panel. Open one of them to render it.
Warming! By default, Graphite-WebApp renders a 24 hour graph. To see some data points zoom in or wait a bit longer.

graphite graph

Troubleshooting

If you follow the above steps exactly, this should work just fine. But in case you don’t see your data in Graphite try the following.

Check Graphite Troubleshooting

Try the troubleshooting solutions provided in Graphite installation tutorial.

Check Skyline log files

less +F /var/log/skyline/horizon.log
less +F /var/log/skyline/analyzer.log
less +F /var/log/skyline/webapp.log

Check That Every Process is Running

ps aux | grep "carbon-ca\|httpd\|collectd\|(wsgi:graphite)\|horizon-agent.py\|analyzer-agent.py\|horizon-agent.py\|carbon-relay.py\|webapp.py"

# result somthing like:
# [...]
# sudo /opt/collectd/sbin/collectd -C /opt/collectd/etc/collectd.conf -f
# [...]
# python /opt/skyline/bin/../src/horizon/horizon-agent.py start
# [...]
# python /opt/skyline/bin/../src/analyzer/analyzer-agent.py start
# [...]
# python /opt/skyline/bin/../src/webapp/webapp.py restart
# /usr/sbin/httpd -DFOREGROUND
# [...]
# /bin/python bin/carbon-relay.py --instance=a start
# /bin/python bin/carbon-cache.py --instance=a start
# [...]

Conclusion

Skyline is the first of its kind: it detects anomalies and trigger alarms in real time. Skyline is definitively trying to build a better world, where DevOps don’t need to spend their time watching metrics!

But there are some drawbacks. It only detects very obvious anomalies. This isn’t too bad, as many anomalies are obvious, but it will still fail to detect complex anomalies. the idea was to build a solution you can extend with homemade detectors, but no third party detectors have yet been released, so we are stuck with the basic functionality.

Of course I’m biased, but for easier monitoring and detection of many other anomaly types, I recommend using our product “Anomaly.io”. We believe it beats Skyline in every way.

The post Detecting Anomalies with Skyline appeared first on Anomaly.

]]>
https://anomaly.io/detect-anomalies-skyline/feed/ 2
Anomaly Detection Using K-Means Clusteringhttps://anomaly.io/anomaly-detection-clustering/ https://anomaly.io/anomaly-detection-clustering/#comments Tue, 30 Jun 2015 12:58:19 +0000 https://anomaly.io/?p=2655 The post Anomaly Detection Using K-Means Clustering appeared first on Anomaly.

]]>

kmean anomaly

Monitored metrics very often exhibit regular patterns. When the values are correlated with the time of the day, it’s easier to spot anomalies, but it’s harder when they do not. In some cases, it is possible to use machine learning to differentiate the usual patterns from the unusual ones.

Regular Patterns

Many metrics follow regular patterns correlated to time. Websites, for example, commonly experience high activity during the day and low activity at night. This correlation with time makes it easy to spot anomalies, but it’s more difficult when that correlation is not present.

1. With Time Correlation

When the metric is correlated to time, the key point is to find its seasonality. When today’s pattern is the same as yesterday, the seasonality is daily. It is also common to find seasonality of one week because Saturday’s patterns often don’t follow Friday’s, but rather those of the Saturday of the previous week. Because many metrics are correlated to human activity, seasonality is often weekly. But it could also be measured in seconds, minutes, hours, month or an arbitrary period.

Subtracting the usual pattern (seasonality) from the overall signal should create a new “difference metric”. Due to randomness, the resulting signal should be pure noise with a mean of zero. Under such conditions, we can apply the normal distribution to detect anomalies. Any values that are too high or too low can be treated as anomalous.

substract seasonality

2.Without Time Correlation

A heartbeat has many recurring patterns. The heart beats on average every 0.8s, but this is an average period, not a fixed period of time. Moreover, the period and the value of the signal might change a lot due to physical activity, stress or other effects. So it isn’t possible to just use a period of 0.8 seconds.

So what can we do? Machine learning to the rescue!

Machine Learning Detection

We can group similar patterns into categories using machine learning. Then, we subtract each new beat with its closest category. If the original beat and the category beat are very similar, the result should be pure noise with a mean of zero. As above, it’s now possible to apply the normal distribution to detect anomalies. For an anomalous beat, even the closest category will still be very different. The anomaly will be easy to detect as it will create a peak in the “difference metric”.

This requires 4 steps:

  1. Sliding Window
  2. Clustering
  3. Noise Transform
  4. Detect Anomalies

1. Sliding Window

The first step is to “cut” the normal heartbeat signal into smaller chunks. To get better results later, we are going to use a sliding window with an overlap of 50%. For each window, we will apply a Hamming function to clean the signal.
In the following heartbeat implementation, we choose a sliding windows of 60 values represented with a red rectangle. So, we will move the window by 30 values. For each window, we multiply the heartbeat signal (in green) by the hamming function (in red). The end result (in green) is saved to create categories in the next step.

hamming sliding window

2. Clustering

The next step is to group together similar patterns produced by the sliding window. We will use one machine learning technique known as k-means clustering using Matlab/Octave or Mahout. This will cluster our signal into a catalogue of 1000 categories. In the following schema, some categories are plotted.

kmean clusters2

kmean clusters2

3. Noise Transform

For each sliding window, we select the closest category from our catalogue of patterns created previously. Subtracting the sliding window from its category should produce a simple noisy signal. If the signal is repeating and enough clusters were chosen, the whole signal can be turned into a normally distributed noisy signal with a mean of zero . Because this noisy signal is normally distributed we can apply the normal distribution to detect anomalies. We can plot its histogram, or use the 99.999% percentile, while highlighting the threshold limit to apply on the noisy signal.

substract window to cluster

normal distribution

kmean histogram

4. Detect Anomalies

For each new sliding window, we repeat the above process to turn it into a noise signal. Previously we defined minimum and maximum thresholds. The noisy sliding window should be within those limits; if not, this is an anomaly! But that’s not all. We also defined the noise signal as being normally distributed with a mean of zero. If it suddenly changes into another distribution, something unusual or anomalous is appending.

heart anomaly animationMin / Max anomaly

graph normal distribution changeddistribution anomaly

normal distribution changeddistribution anomaly

Implementation

1. Requirement

First, you need to download and install Octave (recommended) or Matlab. A few packages are will be used in this tutorial. Open an Octave or Matlab shell, install and import these packages:

% Install package
pkg install -forge io
pkg install -forge statistics
pkg install -forge control
pkg install -forge signal

% Import package
pkg load signal
pkg load statistics

2. Sliding Window

You could use the WFDB Toolbox from physionet.org to download a heartbeat signal and convert it into Octave/Matlab format. But as it’s easier to simply download the converted file “16773m.mat”

Load the signal in Octave/Matlab , convert into sliding windows, and apply the Hamming function:

% change with your own path
cd /path/to/16773m.mat

% load &amp; plot some heartbeat
signal = load("16773m.mat").val(1,:);
plot(signal(1:1000))

% convert to sliding windows
% http://www.mathworks.com/help/signal/ref/buffer.html
windows = buffer(signal, 60, 30);

% apply hamming function
% http://www.mathworks.com/help/dsp/ref/windowfunction.html
mutate = windows.*hamming(60);
points = mutate.';
cleanPoints = points(2:end-1, :);

% save for Mahout
dlmwrite('heatbeat.txt',cleanPoints,'delimiter',' ','precision','%.4f');

All sliding windows are saved in memory into the “cleanPoints” variable and also the “heatbeat.txt” text file. The variable will be used for clustering into Octave/Matlab; the text file will be used for clustering using Mahout.

3. Clustering

Let’s now find common patterns from the signal. We will use the k-means clustering technique, which is part of the machine learning field. It works by grouping together points based on their nearest mean.

.

using octave or mahout

.

3.1. Clustering with Octave or Matlab

When data can fit into RAM, Octave or Matlab is a good choice. When clustering a small quantity of data, such as this heartbeat signal, you should use Octave or Matlab.

Octave and Matlab come with a k-means implementation in the statistics package. As it’s quite slow we will use a faster implementation call Fast K-means. Download fkmeans.m and save the file into the directory of your Octave / Matlab workspace. Use the “pwd” command to find your workspace path.

If you save fkmeans.m in your current workspace, you should be able to create 1000 clusters:

[~, clusters, ~] = fkmeans(cleanPoints, 1000);

It should take around 30 minutes, so let’s save the file, just in case.

save "1000_clusters_fkmeans.mat" clusters -ascii;

If you have any trouble creating this file, you can download 1000_clusters_fkmeans.mat and simply load it within Octave or Matlab:

clusters = load("1000_clusters_fkmeans.mat");

3.2. Clustering with Mahout

When the data you went to cluster is huge and won’t fit into memory, you should choose a tool designed for this use case, such as Mahout.

First, download and install Mahout. Previously we created a file call “heatbeat.txt” containing all the sliding windows. You have to copy this file into a directory call “testdata”. Then, run the Mahout clustering from your shell:

mkdir testdata
cp /path/to/heatbeat.txt testdata/
mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job --input testdata --output out_1000 --numClusters 1000 --t1 200 --t2 100 --maxIter 50

A few hours later, the clustering result will be written into the “out_1000″ directory. Now export the Mahout result into Octave or Matlab. Let’s use cluster2csv.jar to convert this Mahout file into a CSV file.

java -jar cluster2csv.jar out_1000/clusters-50-final/part-r-00000 &gt; 1000_clusters_mahout.csv

Finaly, let’s import 1000_clusters_mahout.csv into Octave or Matlab:

clusters = load("1000_clusters_fkmeans.mat");

4. Noise Transform

To transform the original signal into a noise signal, we need a few steps. First, we need a function that can find the closest cluster for a given signal window:

function clusterIndices = nearKmean(clusters, data)
numObservarations = size(data)(1);
K = size(clusters)(1);
D = zeros(numObservarations, K);
for k=1:K
 %d = sum((x-y).^2).^0.5
 D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
[minDists, clusterIndices] = min(D, [], 2);
end

We need to subtract the window from the closest cluster:

function diffWindows = diffKmean(clusterIndices, clusters, data)
diffWindows = [];
for i = 1:length(clusterIndices)
 entry = data(i,:);
 cluster = clusters(clusterIndices(i),:);
 diff = cluster-entry;
 diffWindows = [diffWindows; diff];
end
end

And finally, a function to reconstruct the “difference signal”, also call the “error signal”:

function reconstruct = asSignal(diffWindows)
rDiffWindows = diffWindows.';
p1 = rDiffWindows(1:size(rDiffWindows,1)/2,2:end);
p2 = rDiffWindows((size(rDiffWindows,1)/2)+1:size(rDiffWindows,1),1:end-1);
reconstruct = reshape(p1.+p2, [], 1)(:)';
end

With a sample, let’s apply the above function to draw the error signal:

samples = cleanPoints(20200:20800, :);
clusterIndices = nearKmean(clusters, samples);
diffWindows = diffKmean(clusterIndices, clusters, samples);
reconstructDelta = asSignal(diffWindows);
plot(reconstructDelta)
ylim ([-200 600])

graph error signal

5. Detect Anomaly

Let’s simulate the beginning of a heart attack and detect the anomaly.

attack = cleanPoints(10002, :);
attack(35:43) = [100,350,325,250,410,380,300,200,100];
clusterId = nearKmean(clusters, attack);
cluster = clusters(clusterId,:);

plot(attack, "linewidth",1.5, 'Color', [0.1484375,0.7578125,0.71484375]);
ylim([-200 500]);
hold on;
plot(cluster, "linewidth",1.5, 'Color', [0.875486381,0.37109375,0.24609375]);
hold off;

heart attack

Clearly, there is a huge difference between the heart attack and its closest pattern. This plot highlights the anomaly.

diff = attack-cluster;
plot(diff, "linewidth",1.5, 'Color', "black");
ylim([-200 500]);

heart attack diff

Previously we established more rules than a simple minimum and maximum. The noisy “error signal” should follow a mean of zero with a specific normal distribution. Let’s simulate this kind of dysfunction.

p1 = cleanPoints(10001:10020, :);
p2 = cleanPoints(10020:10040, :);
p2 = p2.-(p2*1.2)+15;
samples = [p1; p2];
clusterIndices = nearKmean(clusters, samples);
diffWindows = diffKmean(clusterIndices, clusters, samples);
reconstructDelta = asSignal(diffWindows);
plot(reconstructDelta, "linewidth",1.5, 'Color', "black");
ylim([-200 500]);

not-signal-normal-distribution

It doesn’t look normally distributed anymore: for sure, it’s not the same distribution as it usually is. It’s an anomaly.

clusterIndices = nearKmean(clusters, p2);
diffWindows = diffKmean(clusterIndices, clusters, p2);
reconstructDelta = asSignal(diffWindows);
plot(reconstructDelta, "linewidth",1.5, 'Color', "black");
hist(reconstructDelta, 50);

not normal distribution

Conclusion

This technique is useful only under certain conditions. You should apply this technique if the signal doesn’t repeat itself within a fixed period of time. If it does, such as daily or weekly, it is preferable to simply model the seasonality to detect anomalies.

We have only scratched the surface here when studying the error signal as being normally distributed. In reality there are more ways of detecting anomalies in a normally distributed signal.

The post Anomaly Detection Using K-Means Clustering appeared first on Anomaly.

]]>
https://anomaly.io/anomaly-detection-clustering/feed/ 2
Delta Time (Δt) for Anomaly Detectionhttps://anomaly.io/anomaly-detection-delta-time/ https://anomaly.io/anomaly-detection-delta-time/#comments Wed, 17 Jun 2015 08:46:33 +0000 https://anomaly.io/?p=2609 The post Delta Time (Δt) for Anomaly Detection appeared first on Anomaly.

]]>

delta time anomaly

It is common to monitor the number of events that occur in a period of time. Unfortunately, this technique isn’t fast, and can fail to detect some anomalies. The alternative is to change the problem to studying the period of time between events.

Poisson Distribution Doesn’t Work

Traditional monitoring records the number of events for a period of time. There are two major problem with this approach:

  • Due to randomness, it is difficult to spot an anomaly
  • When the event rate is low, detection becomes very slow

Previously, we went into details and solved both problems by studying the period of time between events using the Poisson distribution to detect anomalies (please read this article, if you haven’t yet, before continuing). To apply the Poisson distribution we need to check if we are monitoring a Poisson process. For this to be the case, two conditions must hold.

1. Events need to occur independently

In server monitoring, events don’t always occur independently due to the underlying architecture. For example, when a user writes to a database table, other users can’t access the same table at the same time. They must wait in a queue. When the writer finishes, the queue is processed. This clearly shows a dependency of events so the Poisson distribution can’t be applied.

2. Average rate does not change

It is very common to see seasonal patterns in monitoring. Many servers are very busy during the day and much less at night. This change in rate doesn’t follow the Poisson process. A way to deal with this problem is to compare the average rate one week ago at the same time to predict today’s rate.

Using Percentile of Δt

Poisson anomaly detection was a perfect fit for analyzing time between events. Unfortunately, it doesn’t work in our case because our data aren’t Poisson distributed. Should we stop studying event rates and go back to event counts? Wait! Don’t give up yet, because there is a simple trick we can use. We are actually only interested in extremely rare event rates. So it doesn’t matter if this data isn’t Poisson distributed or normally distributed to detect anomalies. What really matters is to have a lower tail and/or upper tail with a very rare probability distribution.

The main idea is to monitor the interval of time between the event at time t and the previous event at time t-1. The time difference between both events is known as Δt = t – t-1 .We can also monitor the time difference for the n previous events, which is Δt = t – t-n. Using n=1 will provide the fastest detection. Using n>1 allows detection of smaller rate variations, but requires some delay since it needs to wait for n events before anomaly detection.

delta time arrival

Download the data set recording event timestamps in milliseconds and the computed n=1, n=5 and n=10.

  • The n=1 histogram doesn’t show any tail. Based on the distribution, the min and max values look impossible. But, as we don’t know the nature of the data, we will still apply min and max thresholds as a precaution.
  • Interestingly, the n=5 and n=10 histograms do show a tail. It is neither Poisson distributed nor normally distributed, but extremely low or high values can be considered anomalies.

delta time histogram n=1

delta time histogram n=5

delta time histogram n=10

Using the 0.001 and 0.999 percentiles:

  • for n=5, the percentiles are min=332 and max=1496
  • for n=10, the percentiles are min=354 and max=1304

Please note that in this example data set the first 1500 events are normal before an anomaly append.

v = read.csv('delta-record.csv', sep=';', stringsAsFactors=F);
time = c(as.numeric(v[, 'time']));
n1 = c(as.numeric(v[, 'n1']));
n5 = c(as.numeric(v[, 'n5']));
n10 = c(as.numeric(v[, 'n10']));

qn1 = quantile(head(n1,1500), probs=c(0.001, 0.999), na.rm=T)
qn5 = quantile(head(n5,1495), probs=c(0.001, 0.999), na.rm=T)
qn10 = quantile(head(n10,1490), probs=c(0.001, 0.999), na.rm=T)

Draw the data Δt and thresholds for different n:

color = c("black", "#26c2b7", "#e05d38");
plot(x= tail(time, 150), y = tail(n1, 150), type = "l", ylim=c(0,2100), lwd=3, xlab = "time", ylab = "Δt");
lines(x= tail(time, 150), tail(n5, 150),col=color[2], lwd=3);
lines(x= tail(time, 150), tail(n10, 150),col=color[3], lwd=3);

abline(h=qn1, col=color[1], lwd=2);
abline(h=qn5, col=color[2], lwd=2);
abline(h=qn10, col=color[3], lwd=2);

legend("topright", col=color, lwd=2, legend=c("n=1","n=5","n=10"));

delta time graphNote: Aggregated value was divided by n to appear on the same graph. The logic stay unchanged.

A small anomalous rate change occurs at the end of the graph:

  • As discussed before, n=1 is very reactive, but it fails to detect the anomaly as every value are above its minimum limit.
  • At the end of the graph, n=5 goes under its thresholds. It requires more delay than n=1 but at least it detected the anomalous change in the event rate.
  • n=10 required even more time to detect the anomaly. It isn’t useful in this example, because it is far too slow. However, it could have been used to detect a change in rate that n=5 wouldn’t have found.

As already explained, n=1 provide the fastest detection, but fails to detect smaller unusual rate variations. The bigger n is, the more anomalies it will detect by smoothing the variations, unfortunately at the price of speed.

Including Event Rate λ

Usually, some change in rate will be noticed during monitoring. The event rate might even change a lot between day and night. Previously, we omitted such seasonality to calculate the thresholds, but as it happens, it is easy to integrate the event rate λ and to normalize the result. It is as simple as multiplying the event rate λ by the events’ time difference Δt. The result is σ = λ * Δt. The value σ is just a normalized version of  Δt, so we can apply the percentile technique exactly how we did before.

The main difficulty is to have a correct estimate of λ. Based on the metric history, it might be possible to pick the previous rate exactly one week from now; this can work perfectly in some cases. In complex cases, picking λ might be hard, but that is beyond the scope of this article.

As explained in the chart, for each incoming event t :

  1. Select the event rate λ as it was one week ago
  2. Calculate the time difference Δt between event t and event t-n
  3. Multiply Δt and λ to find σ
  4. if σ is lower than the 0.001 percentile or higher than 0.999 percentile the event is an anomaly!

delta time flow detection

Implementation

Previously we learned how to make a Δt anomaly detection. Let’s now build a simple prototype in Java 8 using Joda time and CircularFifoQueue from Apache collections4. CircularFifoQueue is a “first in first out” queue that retains only the last x elements inserted: very useful for storing our n events.

As explained above, when an event occurs, we check if the rate stays between the min and max percentile. But how do we detect a total cessation, when no events occur at all? The trick is to check the queue every 10 ms by introducing a virtual event with the current date. This virtual event doesn’t get stored in the queue; it is only used to check every 10 ms if the rate goes too low. In the following implementation, this is the autoCheck.

The Detection class takes three arguments:

  • List<Rule> min and max percentile of n
  • RateHistory predicts the current rate λ
  • autoCheck checks every 10 ms if the min rate is reached

The heart of the code is in the small detect function that calculates and checks σ from Δt and  λ.

//Detection.java
import java.util.List;
import java.util.Date;
import java.util.Timer;
import java.util.TimerTask;
import org.apache.commons.collections4.queue.CircularFifoQueue;
import org.joda.time.Interval;

public class Detection {
  private CircularFifoQueue queue;
  private List rules;
  private RateHistory rateHistory;

  public Detection(List rules, RateHistory rateHistory, boolean autoCheck){
    this.rules = rules;
    this.rateHistory = rateHistory;
    queue = new CircularFifoQueue(rules.get(rules.size()-1).n);
    if(autoCheck){
      autoCheckEveryMillis(10);
    }
  }

  private void autoCheckEveryMillis(int millis) {
    Detection me = this;
    new Timer().schedule(new TimerTask() {
      public void run()  {
        me.checkAllNoEvent();
      }
    }, 0, millis);
  }

  public void newEvent(){
    queue.add(new Date());
    checkAllEvent();
  }

  public void checkAllEvent(){
    for(Rule rule : this.rules){
      if(queue.size() &gt;= rule.n){
        checkEvent(rule);
      }
    }
  }

  public void checkAllNoEvent(){
    for(Rule rule : this.rules){
      if(queue.size() &gt;= rule.n){
        checkNoEvent(rule);
      }
    }
  }

  private void checkEvent(Rule rule){
    Date last = queue.get(queue.size()-1);
    Date previous = queue.get(queue.size()-rule.n);
    detect(rule, last, previous);
  }

  public void checkNoEvent(Rule rule){
    Date last = new Date();
    Date previous = queue.get(queue.size()-rule.n+1);
    detect(rule, last, previous);
  }

  private void detect(Rule rule, Date last, Date previous){
    float deltaT = delta(previous, last);
    float omega = this.rateHistory.predictCurrent() * deltaT;
    if(rule.minPercentile &gt; omega){
      System.out.println("ANOMALY - rate to low - found by n="+rule.n);
    }else if(omega &gt; rule.maxPercentile){
      System.out.println("ANOMALY - rate to high - found by n="+rule.n);
    }
  }

  private long delta(Date from, Date to){
    Interval interval = new Interval(from.getTime(), to.getTime());
    return interval.toDurationMillis();
  }
}

//Rule.java
import java.util.Date;

public class Rule {
  public Date date = new Date();
  public int n;
  public float minPercentile;
  public float maxPercentile;

  public Rule(int n, float min, float max){
    this.n = n;
    this.minPercentile = min;
    this.maxPercentile = max;
  }
}

//FakeRate.java
public class FakeRate implements RateHistory {
  private boolean isNight = false;

  @Override
  public int predictCurrent() {
    if(isNight){
      return 1;
    }else{
      return 10;
    }
  }

  public void setInNightTime(){
    isNight = true;
  }

  public void setInDayTime(){
    isNight = false;
  }

}

//RateHistory.java
public interface RateHistory {
  public int predictCurrent();
}

//Rule.java
import java.util.Date;

public class Rule {
  public Date date = new Date();
  public int n;
  public float minPercentile;
  public float maxPercentile;

  public Rule(int n, float min, float max){
    this.n = n;
    this.minPercentile = min;
    this.maxPercentile = max;
  }
}

Let’s try this code with two already calculated n:

  1. n=5 with a min and max percentile 3650 and 4750
  2. n=10 with a min and max percentile 8100 and 9800

For this example, we created a FakeRate class. It will predict a rate of λ=10 events during daytime and λ=1 events during nighttime.

//Main.java
import java.util.ArrayList;
import java.util.List;

public class Main {
  private void sendEvent(Detection detection, int rate) 
      throws InterruptedException {
    for(int i=0; i&lt;10; i++){
      Thread.sleep(rate);
      detection.newEvent();
    }
  }

  public void run() throws InterruptedException {
    List rules = new ArrayList();
    rules.add(new Rule(5, 3650, 4750));
    rules.add(new Rule(10, 8100, 9800));
    FakeRate rate = new FakeRate();
    Detection detect = new Detection(rules, rate, false);

    System.out.println("NIGHT TIME");
    rate.setInNightTime();
    sendEvent(detect, 1000);
    System.out.println("* set anomaly too slow...");
    sendEvent(detect, 850);
    System.out.println("* back to normal");
    sendEvent(detect, 1000);
    System.out.println("* set anomaly too fast...");
    sendEvent(detect, 1100);
    System.out.println("\n\rDAY TIME");
    rate.setInDayTime();
    System.out.println("* back to normal");
    sendEvent(detect, 100);
    System.out.println("* set anomaly too fast...");
    sendEvent(detect, 75);

    // Check rules break down
    System.out.println("\n\rWITH AUTOCHECK");
    Detection detect2 = new Detection(rules, rate, true);
    sendEvent(detect2, 1000);
    System.out.println("* set anomaly break...");
    Thread.sleep(110000);
  }

  public static void main(String[] args) throws Exception {
    new Main().run();
  }
}

First anomaly:

  • n=5 detects the anomaly quite fast
  • n=10 requires 4 extra events to detect it

Second anomaly:

  • The change in rate is too small for n=5 to detect
  • n=10 detects the problem after 12 (small) anomalous events

When transiting to “daytime,” many anomalies are detected. This is because our rate becomes 10x faster. In reality, this should be a smoother transition, so it shouldn’t trigger any alert.

The third anomaly:

  • n=5 detects the anomaly quickly
  • n=10 requires 2 extra events to detect it

In the last part, we create an anomaly detection without sending any event. Since the autoCheck is enabled (as it always should  be), it detects the low rate within 10 ms.

Conclusion

As discussed in detail when building an anomaly detection using the Poisson-distribution, focusing on the event rate will detect anomalies very quickly. Under high rates, a few milliseconds can be enough to spot an anomaly.

Anomaly detection studies outliers: It doesn’t matter what the distribution is as long as outliers are detected. When plotting the distribution, a tail or two might become visible. The upper tail can be used to detect a “too high” rate change, the lower tail a “too low” rate change. If no tail is present, choose higher values of n. Still no tail in sight? Maybe you should consider other anomaly detection techniques…

The key to success of this anomaly detection technique is to very carefully choose the rate predictor λ. As every calculation depends on the λ prediction, picking the wrong value will result in many false positives.

The post Delta Time (Δt) for Anomaly Detection appeared first on Anomaly.

]]>
https://anomaly.io/anomaly-detection-delta-time/feed/ 3