Saturday, July 18, 2015

Anomaly Detection with Holt-Winters in Graphite

My final post in this series on anomaly detection in Graphite will deal with Holt-Winters functions. Graphite has a few functions here that are based off of Holt-Winters predictions. I will attempt to look at the use of some of them and end up showing a simple way for alerting on anomalies similar to timeShift() and coefficient of variation.

Before we get to Holt-Winters it is probably a good idea to explain a concept called smoothing and how it can be helpful when trying to understand data. For a variety of reasons when examining time-series data large variations can be seen in a single data point or small group of data points that are not interesting from an analysis standpoint. The non-technical term is called a "fluke". If you have an alerting system wired to do pure threshold based monitoring, i.e. if you see latency of a transaction greater than 500ms send an alert, you could get a large number of false positives due to the occasional fluke.  Quirks in how the application records metrics or even your metric collection system itself can contribute to this phenomenon.

Therefore, it can be beneficial to smooth the data out before performing any action on it, such as alerting. One simple way to do this is through the use of window functions. In a time-series data set they simply take the last 'K' number of data points and perform a function on them. That function could be a sum total, lowest, highest, 90th percentile, etc... Graphite provides the summarize() function to do just that. Sticking with our latency example you could plot the average latency from a data set over the past five minutes.

Window functions are useful in so far as you can assume the most recent data is the most relevant. Variations on window functions even allow for things such as assigning weights where more recent data is "counted more" and older data is "counted less". The exponential smoothing function is a type of weighted average. As you get further away from the current time the data points count less and less towards your smoothing operation.

The above discussed functionality doesn't really account for data being seasonal or to put another way that it can trend. In the last post I used the example where transaction volume increased during business hours. Techniques called double and triple exponential smoothing were invented to account for both the relevant timeliness of data and its seasonal nature; of which Holt-Winters is one. Check out the Wiki page for a breakdown of all the statistical equations.

Graphite has four functions that can help plot Holt-Winters series. One of those functions is the holtWintersConfidenceBands(). It plots an upper bound and a lower bound series based on data for the previous week to seed it. Below is an example on some transaction data. The blue line is the lower bound, the green line is the upper bound and the red line is the actual data set. You can see that Holt-Winters did a pretty good job predicting where the real data would fall.



In addition to the confidence bands, Graphite has another useful function called holtWintersAberration(). It takes a series and plots the delta between what Holt-Winters predicted and the actual value. Similar to what we did in my last two posts we can take this value and create a type of dimensionless heuristic using the following function by relating the Aberration to the original metric. 



The Holt-Winters Aberration can either be positive or negative depending on its relationship to the original metric. In order to simplify alerting, I take the absolute value so my alerting function in Seyren can be a single vector. In order to "weaponize" this for production pick a metric of interest and look at the historical values of the above function. Cross-reference those values with known trouble times and select an alerting threshold based on that comparison. I will mention again that I simplify a lot of what is covered in the past three posts in a simple shell script that is available in my Github repo




No comments:

Post a Comment

AWS Glue, Fitbit and a "Health Data Lake" - part 1

A couple years ago I got a Charge HR Fitbit device. I have worn it off and on for the past couple years. It has been mildly entertaining to ...