Friday, July 3, 2015

Anomaly Detection with the Coefficient of Variation in Graphtie

Whether in operational or security metrics anomaly detection can be a tricky thing. Trying to nail down a reliable heuristic that can work across a varied set of time-series data is not easy. Static checks on the other hand can be much more straight forward; put a message in Slack when the disk usage exceeds 95% or send an alert to Pager Duty if average user CPU exceeds 90%. Given the simplicity of direct threshold checks why even attempt to come up with anomaly detection? Answer: automation.

At work we monitor multiple environments with a wide range of application level time-series transactional metrics. These metrics can vary greatly between environments. Instead of trying to predict transaction rates ahead of time it would be beneficial to automate the creation of a check that was independent of the environment yet high fidelity enough that it would be worth paying attention to. In our shop we send application level metrics to statsD which forwards to a Graphite cluster. We alert on this Graphite cluster using Seyren.

Graphite has some pretty cool functions built in to help with anomaly detection such as Holt-Winters aberrations. Perhaps in a later blog post I can comment on how to make use of it. For this post I want to concentrate on a common statistical equation called the coefficient of varation. From Wikipedia:
The coefficient of variation (CV) is defined as the ratio of the standard deviation to the mean. 




The CV is valuable because it assumes that the standard deviation should always be related to the mean. It is especially useful here because we needed a generic calculation that would be independent of the transactional metric being checked. It is also important to note that the CV is a dimensionless metric. Therefore you get a type of heuristic after you make the calculation. 

Thankfully, Graphite has a standard deviation function built in as well as a divide series function. In Graphite you can build a function that looks something like this to calculate the CV:




In the above $METRIC is the Graphite metric you are trying to calculate. An example would be a namespace something like: server1.application_metrics.transactions_rate. The $CV_WINDOW is the length of time you want to go back to calculate. I think Seyren has trouble going back more then ten to fifteen minutes but I could be wrong.

To make it actionable you place the formula first into Graphite and see what historical data looks like. Cross reference any known trouble spots by the CV value and you can start to formulate an alerting threshold. Perhaps you want to warn when the heuristic gets to 3.0 and alert when it gets to 4.0. Input those values into the Seyren and you will get a graph that looks like below.


At the beginning of this post I said one of the driving reasons for going through all of this trouble was automation. Seyren has an API that can be used to set up alerts. I have hosted on my Github account a shell script that can automate CV checks as well as time-shift and Holt-Winters (I will try to cover my approach to these two in a later post):

https://github.com/ChrisRimondi/devops_security/blob/master/seyren_template.sh

The idea is that you can wire up the script to fire when a new environment or application server is provisioned.

3 comments:

AWS Glue, Fitbit and a "Health Data Lake" - part 1

A couple years ago I got a Charge HR Fitbit device. I have worn it off and on for the past couple years. It has been mildly entertaining to ...