Saturday, July 18, 2015

Anomaly Detection with Holt-Winters in Graphite

My final post in this series on anomaly detection in Graphite will deal with Holt-Winters functions. Graphite has a few functions here that are based off of Holt-Winters predictions. I will attempt to look at the use of some of them and end up showing a simple way for alerting on anomalies similar to timeShift() and coefficient of variation.

Before we get to Holt-Winters it is probably a good idea to explain a concept called smoothing and how it can be helpful when trying to understand data. For a variety of reasons when examining time-series data large variations can be seen in a single data point or small group of data points that are not interesting from an analysis standpoint. The non-technical term is called a "fluke". If you have an alerting system wired to do pure threshold based monitoring, i.e. if you see latency of a transaction greater than 500ms send an alert, you could get a large number of false positives due to the occasional fluke.  Quirks in how the application records metrics or even your metric collection system itself can contribute to this phenomenon.

Therefore, it can be beneficial to smooth the data out before performing any action on it, such as alerting. One simple way to do this is through the use of window functions. In a time-series data set they simply take the last 'K' number of data points and perform a function on them. That function could be a sum total, lowest, highest, 90th percentile, etc... Graphite provides the summarize() function to do just that. Sticking with our latency example you could plot the average latency from a data set over the past five minutes.

Window functions are useful in so far as you can assume the most recent data is the most relevant. Variations on window functions even allow for things such as assigning weights where more recent data is "counted more" and older data is "counted less". The exponential smoothing function is a type of weighted average. As you get further away from the current time the data points count less and less towards your smoothing operation.

The above discussed functionality doesn't really account for data being seasonal or to put another way that it can trend. In the last post I used the example where transaction volume increased during business hours. Techniques called double and triple exponential smoothing were invented to account for both the relevant timeliness of data and its seasonal nature; of which Holt-Winters is one. Check out the Wiki page for a breakdown of all the statistical equations.

Graphite has four functions that can help plot Holt-Winters series. One of those functions is the holtWintersConfidenceBands(). It plots an upper bound and a lower bound series based on data for the previous week to seed it. Below is an example on some transaction data. The blue line is the lower bound, the green line is the upper bound and the red line is the actual data set. You can see that Holt-Winters did a pretty good job predicting where the real data would fall.



In addition to the confidence bands, Graphite has another useful function called holtWintersAberration(). It takes a series and plots the delta between what Holt-Winters predicted and the actual value. Similar to what we did in my last two posts we can take this value and create a type of dimensionless heuristic using the following function by relating the Aberration to the original metric. 



The Holt-Winters Aberration can either be positive or negative depending on its relationship to the original metric. In order to simplify alerting, I take the absolute value so my alerting function in Seyren can be a single vector. In order to "weaponize" this for production pick a metric of interest and look at the historical values of the above function. Cross-reference those values with known trouble times and select an alerting threshold based on that comparison. I will mention again that I simplify a lot of what is covered in the past three posts in a simple shell script that is available in my Github repo




Saturday, July 11, 2015

Anomaly Detection with timeShift() in Graphite

In my last post I discussed combining Graphite, Seyren and a little math equation called the coefficient of variation to come up with a statistical way to detect anomalies in your time-series data. As promised in that post I will cover two more functional capabilities, native to Graphite to find anomalies. The first and simplest of these functions is called timeShift() which I will review here. It operates on a series of data and shifts it a user specified amount of time. For example the function below shifts the data target metric back seven days.


Viewing the timeShift() on a metric would be an interesting thing to spot check on a dashboard. For example what is the average number of 404's my site gets per hour on this day at this time. (Related there is another Graphite function called timeStack() that graphs the data for a specified interval that would likely be even more useful for a dashboard.)

However, our purpose in this blog post is not to simply draw a helpful dashboard but to do some basic anomaly detection by creating a data series that is a dimensionless heuristic similar to the coefficient of variation. In order to do this we can relate the timeShift() to the metric it is operating on. This can be accomplished by taking the original metric, subtracting the timeShift, dividing by said metric and then taking the absolute value (While taking the absolute value does result in some information loss it makes it easier to produce single dimension alerts). Represented in mathematical notation:


This function can be used on nearly any time-series metric to produce a meaningful calculation. In Graphite it is represented in the function below; where $METRIC is the interesting time-series data you care about and $SHIFT is the amount of time you want to shift back (i.e. 7 days). 

You now have a metric that can be monitored in Seyren. By plotting this historically and comparing its value to when there was system trouble it is possible to create meaningful warning and error alerts in Seyren.

In closing we should lay out the assumptions implicit in detecting anomalies with the coefficient of variation and the time-shift approach outlined above. Coefficient of variation calculations assume that recent data is most relevant to predicting anomalies. It essentially asks the question, is this data very different from a recently calculated mean? Alerting on it requires you to ask how much is too much for this data to change in this time window?  The time-shift assumes something entirely different. It assumes your data is seasonal, i.e. it has predictable patters such as business hours traffic. When you alert on this type of data you compare it to a previous time period and check if it varies within an acceptable level. Both coefficient of variation and time-shift checks have their place. However, what if you could have your cake and eat it to? That is essentially what Holt-Winters is all about and on my next post I will cover how to write Seyren checks based on those calculations.

Friday, July 3, 2015

Anomaly Detection with the Coefficient of Variation in Graphtie

Whether in operational or security metrics anomaly detection can be a tricky thing. Trying to nail down a reliable heuristic that can work across a varied set of time-series data is not easy. Static checks on the other hand can be much more straight forward; put a message in Slack when the disk usage exceeds 95% or send an alert to Pager Duty if average user CPU exceeds 90%. Given the simplicity of direct threshold checks why even attempt to come up with anomaly detection? Answer: automation.

At work we monitor multiple environments with a wide range of application level time-series transactional metrics. These metrics can vary greatly between environments. Instead of trying to predict transaction rates ahead of time it would be beneficial to automate the creation of a check that was independent of the environment yet high fidelity enough that it would be worth paying attention to. In our shop we send application level metrics to statsD which forwards to a Graphite cluster. We alert on this Graphite cluster using Seyren.

Graphite has some pretty cool functions built in to help with anomaly detection such as Holt-Winters aberrations. Perhaps in a later blog post I can comment on how to make use of it. For this post I want to concentrate on a common statistical equation called the coefficient of varation. From Wikipedia:
The coefficient of variation (CV) is defined as the ratio of the standard deviation to the mean. 




The CV is valuable because it assumes that the standard deviation should always be related to the mean. It is especially useful here because we needed a generic calculation that would be independent of the transactional metric being checked. It is also important to note that the CV is a dimensionless metric. Therefore you get a type of heuristic after you make the calculation. 

Thankfully, Graphite has a standard deviation function built in as well as a divide series function. In Graphite you can build a function that looks something like this to calculate the CV:




In the above $METRIC is the Graphite metric you are trying to calculate. An example would be a namespace something like: server1.application_metrics.transactions_rate. The $CV_WINDOW is the length of time you want to go back to calculate. I think Seyren has trouble going back more then ten to fifteen minutes but I could be wrong.

To make it actionable you place the formula first into Graphite and see what historical data looks like. Cross reference any known trouble spots by the CV value and you can start to formulate an alerting threshold. Perhaps you want to warn when the heuristic gets to 3.0 and alert when it gets to 4.0. Input those values into the Seyren and you will get a graph that looks like below.


At the beginning of this post I said one of the driving reasons for going through all of this trouble was automation. Seyren has an API that can be used to set up alerts. I have hosted on my Github account a shell script that can automate CV checks as well as time-shift and Holt-Winters (I will try to cover my approach to these two in a later post):

https://github.com/ChrisRimondi/devops_security/blob/master/seyren_template.sh

The idea is that you can wire up the script to fire when a new environment or application server is provisioned.

AWS Glue, Fitbit and a "Health Data Lake" - part 1

A couple years ago I got a Charge HR Fitbit device. I have worn it off and on for the past couple years. It has been mildly entertaining to ...