Saturday, July 11, 2015

Anomaly Detection with timeShift() in Graphite

In my last post I discussed combining Graphite, Seyren and a little math equation called the coefficient of variation to come up with a statistical way to detect anomalies in your time-series data. As promised in that post I will cover two more functional capabilities, native to Graphite to find anomalies. The first and simplest of these functions is called timeShift() which I will review here. It operates on a series of data and shifts it a user specified amount of time. For example the function below shifts the data target metric back seven days.


Viewing the timeShift() on a metric would be an interesting thing to spot check on a dashboard. For example what is the average number of 404's my site gets per hour on this day at this time. (Related there is another Graphite function called timeStack() that graphs the data for a specified interval that would likely be even more useful for a dashboard.)

However, our purpose in this blog post is not to simply draw a helpful dashboard but to do some basic anomaly detection by creating a data series that is a dimensionless heuristic similar to the coefficient of variation. In order to do this we can relate the timeShift() to the metric it is operating on. This can be accomplished by taking the original metric, subtracting the timeShift, dividing by said metric and then taking the absolute value (While taking the absolute value does result in some information loss it makes it easier to produce single dimension alerts). Represented in mathematical notation:


This function can be used on nearly any time-series metric to produce a meaningful calculation. In Graphite it is represented in the function below; where $METRIC is the interesting time-series data you care about and $SHIFT is the amount of time you want to shift back (i.e. 7 days). 

You now have a metric that can be monitored in Seyren. By plotting this historically and comparing its value to when there was system trouble it is possible to create meaningful warning and error alerts in Seyren.

In closing we should lay out the assumptions implicit in detecting anomalies with the coefficient of variation and the time-shift approach outlined above. Coefficient of variation calculations assume that recent data is most relevant to predicting anomalies. It essentially asks the question, is this data very different from a recently calculated mean? Alerting on it requires you to ask how much is too much for this data to change in this time window?  The time-shift assumes something entirely different. It assumes your data is seasonal, i.e. it has predictable patters such as business hours traffic. When you alert on this type of data you compare it to a previous time period and check if it varies within an acceptable level. Both coefficient of variation and time-shift checks have their place. However, what if you could have your cake and eat it to? That is essentially what Holt-Winters is all about and on my next post I will cover how to write Seyren checks based on those calculations.

No comments:

Post a Comment

AWS Glue, Fitbit and a "Health Data Lake" - part 1

A couple years ago I got a Charge HR Fitbit device. I have worn it off and on for the past couple years. It has been mildly entertaining to ...