Saturday, May 19, 2018

AWS Glue, Fitbit and a "Health Data Lake" - part 1

A couple years ago I got a Charge HR Fitbit device. I have worn it off and on for the past couple years. It has been mildly entertaining to observe my sleep patterns, track my steps, and benchmark my resting heart rate. As expected, consumers get the data from Fitbit in a formatted fashion; presented in tables, charts, and other visualizations. I began to wonder if the data that made up those visualizations was accessible to consumers in its raw format.

As it turns out, I was not the only person to wonder about this. After some googling I stumbled across two relevant articles by Stephen Hsu.

The first article explains how he collected the data and the second describes what he learned from the data. These two articles really kick-started my understanding of what the Fitbit API is capable of providing. Much thanks to Stephen for sharing these!

OAuth hack job

Fitbit provides an API where developers can query up to 30 days worth of historical data. Unfortunately, there is no standard API key for machine-to-machine access; it utilizes OAuth 2.0 for authorization. My goal was/is to regularly pull data in an automated fashion. No API key option was going to make this a total hack job. Thankfully, I am in Operations. Duct tape and bailing wire solutions are my speciality.  

A custom Python Fitbit client repo was linked to in the first article: https://github.com/orcasgit/python-fitbit. It uses 'cherrypy' to spin up a local web server that your browser will be redirected to cache OAuth credentials. Here is the sequence of steps I took to automate this:
  1. Create a EC2 t2.micro Ubuntu 16.04 instance
  2. Install xfce on for the UI using AWS guidelines here: https://aws.amazon.com/premiumsupport/knowledge-center/connect-to-ubuntu-1604-windows/
  3. Use VNC to log into the UI of the instance
  4. Install Firefox as the default browser
  5. Create a simple bash script that will execute my implementation of the Python client
    1. https://github.com/ChrisRimondi/fitbit_scripts/blob/master/get_health.sh
    2. https://github.com/ChrisRimondi/fitbit_scripts/blob/master/sample_env_vars.sh
  6. Run through the script once while I am logged into the UI. Log into Fitbit API developer portal when prompted by the script. (I realize this is isn't permanent but it buys me some automation.)
  7. Create a cron job to run the script daily so it retrieves my heart and sleep data without me needing to be in front of the machine. 
    1. You will see in the sample env vars linked to above that I had to source the browser and display. 
The script downloads the data to a CSV file and uploads that data to an S3 bucket. Given the limitations of Fitbit's API, you can only grab the previous 30 days of data, this provides me with a way to keep this data for perpetuity. 

More Automation Frustrations

There is a setting in the Fitbit application called 'All-Day Sync'. My assumption was that this setting would gradually synchronize my Charge HR with my phone all day long over Bluetooth and not require me to log into the application for synchronization. This seems to be only the case with heart rate data. Sleep data does not synchronize unless you open the Fitbit application on your phone and let it do its thing. This process can take several minutes. The likelihood of me remembering to do this every day is slim. Therefore, I run my automation script over the past week's worth of data to make sure I don't miss any dates.  

AWS Glue

The second article on analyzing the data does local analysis of the data using data frames in the Python Panda library. This is probably the most efficient way to analyze the data. Panda data frames are excellent and very intuitive for performing analytical operations on data. However, my goal was not efficiency but learning. I often like to combine different concepts to maximize learning. Glue is one AWS service that I have not had a lot of opportunity to utilize but have been interested in since it was announced at re:Invent a couple years ago. Therefore, while it is complete overkill I decided to use Glue so that I could learn the service and build the foundation for other types of data forming a type of personal "Health Data Lake" that is available for me to run analytics on. 

After my heart rate and sleep data was uploaded to S3, I would first need to add the data schema to a catalog so that other operations can operate on it. At first, all of my data would be residing in an S3 bucket with three folders:
  • sleep
  • heartrate
  • working
The latter category I added manually. This is a simple CSV file that keeps track of the date and whether I worked on that day. I think it would be interesting to see how whether I was working impacted different biometrics.

Glue has a very simple way of performing this categorization through its 'crawler' functions. AWS defines these as: "A crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your data catalog." The crawler click-through wizard was intuitive and it categorized my data correctly. I was then able to reference the data through AWS Glue context. 

From here I was able to create an ETL job that I ran on demand to access the sleep data and do a simple sum on the minutes I slept that night. That script can be found here: https://github.com/ChrisRimondi/fitbit_scripts/blob/master/fitbit_etl_job.py. The script then uploads the data to S3 in another CSV formatted file. It is this file I will continue to build upon by joining disparate data sources and then performing analytical operations on them. 

Note: The above script uses the Glue dynamic frames and Spark data frames. Given the very low volume of data this could have easily been accomplished with AWS Lambda. In the future, I might build some steps with Lambda. However, I chose the heavyweight Spark option vs Lambda to learn something new. 

Next Steps

As mentioned above I want to continue to use AWS Glue and build out a variety of ETL jobs to join different data sources. My initial format columnar format will something like this:

date | sleep minutes before | sleep minutes after | worked that day | average heart rate | step count

Past that it will be interesting to add what I ate during the day and exercise as well as environmental data such as what was the weather in my location. I will also want to post-process the sleep and heart data to add information such as longest sleep cycle in minutes and highest sustained heart rate to the tuple. Once I have a sufficient sized data set I could use simple ML algorithms to look for relationships between the columns in the tuple.

This has been a fun exercise for both learning the Fitbit API and AWS Glue. It sets the foundation for further exploration in this space. Stay tuned for more blog posts on the subject.  

Saturday, March 12, 2016

Conflict in Engineering Organizations

My best days at work are the ones full conflict. My worst days at work are the ones full of conflict. I am not a schizophrenic but am referring to two completely distinct types of conflict. On the good side, I love cognitive conflict. It challenges me to wrestle with new ideas and think through different perspectives. Brainstorming, white-boarding and out-of-the-box problem solving are all examples of cognitive conflict thought processes. It is the type of conflict known to drive innovation. It is what we should strive for at work. 

I hate affective conflict. When the team’s energy turns from solving a problem to “solving a person” we have affective conflict. At work this type of conflict can be driven by situations of ambiguity in roles and responsibilities. It can also be driven by individuals who believe the corporate economy is a zero-sum game and look to consolidate power in strong-armed or manipulative fashions. Mental energy is no longer focused on the mission. We now have ‘Game of Thrones’ style corporate politics.

To break out of the cycle we should start with role clarity; knowing that it drives employee morale. There is nothing more important for productivity than employee morale. Most companies that are known for great employee morale and culture also have incredibly productive employees. Keep a casual tally of interesting open source projects or great technology blogs you read. Where do the authors work? Notice any patterns?

Furthermore, what fascinates me is how corporate culture impacts technology choices. The obvious example is Conway’s Law: "organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.” In my experience working with engineering organizations, simple and elegant systems are produced by teams who care more about the mission (cognitive conflict) and less about job security and pet projects (affective conflict). An engineer who solves a problem in a way that is “stupid simple” cares about the mission as demonstrated by an easy to integrate system design. While, building complex solutions is a great way to stake out corporate real estate because few others will understand the Frankenstein system. 

Morale and culture are force multipliers. Done right and we have energized employees who innovate. Systems are produced that are elegant and simple. Their architecture is easily communicated to other teams and become leveraged in other projects. Done wrong and we have a cycle of protective technology decisions or ones done with little input from other teams. These systems produce technical debt and are a burden to the organization. Therefore, maximize cognitive conflict at work and minimize affective conflict. Strive for clarity in roles and manage morale. Nothing is more important for engineering teams to succeed. 

Saturday, February 6, 2016

Top Five SRE Architecture Principles

When interacting with software developers at work, our site reliability engineering (SRE) team found common themes when discussing scalability issues surrounding applications we support; independent of the service under consideration. Therefore, we discussed ways which we could concisely communicate our expectations for application architecture when operating in a cloud environment. What we came up with is a distilled list of principles (only five) that we refer back to when consulting on new projects or evaluating technical specifications.

  1. Stateless - The state of a service should be determined by a shared database and not dependent on data local to application. Storage should be treated as a service within itself and antiquated thinking of storage as a device should be avoided. 
  2. Scale linearly - An application should run as a single process on a small as possible footprint. This enables the SRE team to scale services linearly in a granular fashion. Code logic should not necessitate a specific number of instances but be capable of scaling up or down as load changes. Discrete functionality is preferred such that there is a single and obvious metric to scale upon. 
  3. Minimal configuration - Services should require little to no configuration. We have found configuration management to be a ripe source of human error, therefore services should ship with sane defaults and infer as much as possible on startup from consideration of environment variables or service discovery. Thread and memory footprints should configure automatically maximizing the resource usage on an instance. A side benefit is that the less configuration options available on a service the less permutations are needed for testing. 
  4. Robust communication - A great quote from Release It! is, "Integration points are the number-one killer of systems. Every single one of those feeds presents a stability risk." Therefore, we ask that all integration points of a service be enumerated and have a proper harness to torture test data input and output. Communication should be asynchronous whenever possible (and it almost always is). Adequate controls at integrate points should existing including circuit breakers, time-outs, bulkheads and protocol hand-shaking. 
  5. Application Visibility - At a minimum all inbound and outbound transactions should have telemetry that provides visibility on the number/size of transactions, their type and the time the transaction took to execute. Services should know if they are functioning correctly and make this health status available to the SRE team through some type of API (usually REST). Logging is also a critical component of application visibility. It should be obvious when reading the log files whether the service is working. 
There were three sources that we relied heavily on for inspiration. I want to explicitly call these out so we can give credit where credit is due and encourage people to look up these fantastic resources:

Finally, a word on the importance of principles. Modern systems are complex. Distributed, cloud systems are particularly complex because of the number of integration points. The allure of intuitively understanding systems with thousands of nodes is toxic. When designing these distributed systems we prefer the pattern/anti-pattern approach where principles are preferred over attempting to enumerate every possible failure scenario. You will hit failure edge cases you never thought of. Therefore, relying on principles instead of our own capability to exhaustively understand a system is a practical dose of humility. 

Saturday, July 18, 2015

Anomaly Detection with Holt-Winters in Graphite

My final post in this series on anomaly detection in Graphite will deal with Holt-Winters functions. Graphite has a few functions here that are based off of Holt-Winters predictions. I will attempt to look at the use of some of them and end up showing a simple way for alerting on anomalies similar to timeShift() and coefficient of variation.

Before we get to Holt-Winters it is probably a good idea to explain a concept called smoothing and how it can be helpful when trying to understand data. For a variety of reasons when examining time-series data large variations can be seen in a single data point or small group of data points that are not interesting from an analysis standpoint. The non-technical term is called a "fluke". If you have an alerting system wired to do pure threshold based monitoring, i.e. if you see latency of a transaction greater than 500ms send an alert, you could get a large number of false positives due to the occasional fluke.  Quirks in how the application records metrics or even your metric collection system itself can contribute to this phenomenon.

Therefore, it can be beneficial to smooth the data out before performing any action on it, such as alerting. One simple way to do this is through the use of window functions. In a time-series data set they simply take the last 'K' number of data points and perform a function on them. That function could be a sum total, lowest, highest, 90th percentile, etc... Graphite provides the summarize() function to do just that. Sticking with our latency example you could plot the average latency from a data set over the past five minutes.

Window functions are useful in so far as you can assume the most recent data is the most relevant. Variations on window functions even allow for things such as assigning weights where more recent data is "counted more" and older data is "counted less". The exponential smoothing function is a type of weighted average. As you get further away from the current time the data points count less and less towards your smoothing operation.

The above discussed functionality doesn't really account for data being seasonal or to put another way that it can trend. In the last post I used the example where transaction volume increased during business hours. Techniques called double and triple exponential smoothing were invented to account for both the relevant timeliness of data and its seasonal nature; of which Holt-Winters is one. Check out the Wiki page for a breakdown of all the statistical equations.

Graphite has four functions that can help plot Holt-Winters series. One of those functions is the holtWintersConfidenceBands(). It plots an upper bound and a lower bound series based on data for the previous week to seed it. Below is an example on some transaction data. The blue line is the lower bound, the green line is the upper bound and the red line is the actual data set. You can see that Holt-Winters did a pretty good job predicting where the real data would fall.



In addition to the confidence bands, Graphite has another useful function called holtWintersAberration(). It takes a series and plots the delta between what Holt-Winters predicted and the actual value. Similar to what we did in my last two posts we can take this value and create a type of dimensionless heuristic using the following function by relating the Aberration to the original metric. 



The Holt-Winters Aberration can either be positive or negative depending on its relationship to the original metric. In order to simplify alerting, I take the absolute value so my alerting function in Seyren can be a single vector. In order to "weaponize" this for production pick a metric of interest and look at the historical values of the above function. Cross-reference those values with known trouble times and select an alerting threshold based on that comparison. I will mention again that I simplify a lot of what is covered in the past three posts in a simple shell script that is available in my Github repo




Saturday, July 11, 2015

Anomaly Detection with timeShift() in Graphite

In my last post I discussed combining Graphite, Seyren and a little math equation called the coefficient of variation to come up with a statistical way to detect anomalies in your time-series data. As promised in that post I will cover two more functional capabilities, native to Graphite to find anomalies. The first and simplest of these functions is called timeShift() which I will review here. It operates on a series of data and shifts it a user specified amount of time. For example the function below shifts the data target metric back seven days.


Viewing the timeShift() on a metric would be an interesting thing to spot check on a dashboard. For example what is the average number of 404's my site gets per hour on this day at this time. (Related there is another Graphite function called timeStack() that graphs the data for a specified interval that would likely be even more useful for a dashboard.)

However, our purpose in this blog post is not to simply draw a helpful dashboard but to do some basic anomaly detection by creating a data series that is a dimensionless heuristic similar to the coefficient of variation. In order to do this we can relate the timeShift() to the metric it is operating on. This can be accomplished by taking the original metric, subtracting the timeShift, dividing by said metric and then taking the absolute value (While taking the absolute value does result in some information loss it makes it easier to produce single dimension alerts). Represented in mathematical notation:


This function can be used on nearly any time-series metric to produce a meaningful calculation. In Graphite it is represented in the function below; where $METRIC is the interesting time-series data you care about and $SHIFT is the amount of time you want to shift back (i.e. 7 days). 

You now have a metric that can be monitored in Seyren. By plotting this historically and comparing its value to when there was system trouble it is possible to create meaningful warning and error alerts in Seyren.

In closing we should lay out the assumptions implicit in detecting anomalies with the coefficient of variation and the time-shift approach outlined above. Coefficient of variation calculations assume that recent data is most relevant to predicting anomalies. It essentially asks the question, is this data very different from a recently calculated mean? Alerting on it requires you to ask how much is too much for this data to change in this time window?  The time-shift assumes something entirely different. It assumes your data is seasonal, i.e. it has predictable patters such as business hours traffic. When you alert on this type of data you compare it to a previous time period and check if it varies within an acceptable level. Both coefficient of variation and time-shift checks have their place. However, what if you could have your cake and eat it to? That is essentially what Holt-Winters is all about and on my next post I will cover how to write Seyren checks based on those calculations.

Friday, July 3, 2015

Anomaly Detection with the Coefficient of Variation in Graphtie

Whether in operational or security metrics anomaly detection can be a tricky thing. Trying to nail down a reliable heuristic that can work across a varied set of time-series data is not easy. Static checks on the other hand can be much more straight forward; put a message in Slack when the disk usage exceeds 95% or send an alert to Pager Duty if average user CPU exceeds 90%. Given the simplicity of direct threshold checks why even attempt to come up with anomaly detection? Answer: automation.

At work we monitor multiple environments with a wide range of application level time-series transactional metrics. These metrics can vary greatly between environments. Instead of trying to predict transaction rates ahead of time it would be beneficial to automate the creation of a check that was independent of the environment yet high fidelity enough that it would be worth paying attention to. In our shop we send application level metrics to statsD which forwards to a Graphite cluster. We alert on this Graphite cluster using Seyren.

Graphite has some pretty cool functions built in to help with anomaly detection such as Holt-Winters aberrations. Perhaps in a later blog post I can comment on how to make use of it. For this post I want to concentrate on a common statistical equation called the coefficient of varation. From Wikipedia:
The coefficient of variation (CV) is defined as the ratio of the standard deviation to the mean. 




The CV is valuable because it assumes that the standard deviation should always be related to the mean. It is especially useful here because we needed a generic calculation that would be independent of the transactional metric being checked. It is also important to note that the CV is a dimensionless metric. Therefore you get a type of heuristic after you make the calculation. 

Thankfully, Graphite has a standard deviation function built in as well as a divide series function. In Graphite you can build a function that looks something like this to calculate the CV:




In the above $METRIC is the Graphite metric you are trying to calculate. An example would be a namespace something like: server1.application_metrics.transactions_rate. The $CV_WINDOW is the length of time you want to go back to calculate. I think Seyren has trouble going back more then ten to fifteen minutes but I could be wrong.

To make it actionable you place the formula first into Graphite and see what historical data looks like. Cross reference any known trouble spots by the CV value and you can start to formulate an alerting threshold. Perhaps you want to warn when the heuristic gets to 3.0 and alert when it gets to 4.0. Input those values into the Seyren and you will get a graph that looks like below.


At the beginning of this post I said one of the driving reasons for going through all of this trouble was automation. Seyren has an API that can be used to set up alerts. I have hosted on my Github account a shell script that can automate CV checks as well as time-shift and Holt-Winters (I will try to cover my approach to these two in a later post):

https://github.com/ChrisRimondi/devops_security/blob/master/seyren_template.sh

The idea is that you can wire up the script to fire when a new environment or application server is provisioned.

Saturday, May 31, 2014

Vulnerability Data into Elasticsearch

My day job has me focusing on Elasticsearch more these days. A while back I did a post on getting vulnerability data into ELSA. As a follow up I have been meaning to write a brief post on how to do the same with Elasticsearch. If you are not familiar with Elasticsearch go check it out here. From their website it is classified as "distributed restful search and analytics". It is often combined with Logstash and Kibana forming the "ELK" stack. The same reasons that having vulnerability data available with your event logs was a good idea in ELSA also apply if you are using the ELK stack. I modified my existing script to take an input file from one of several vulnerability scanners and index the results with Elasticsearch.

Before we begin my Python script makes use of the Elasticsearch API. I installed it via pip:

# pip install elasticsearch

I assume  an index exists called vulns. You can create it by hitting up the Elasticsearch API like this:
$ curl -XPUT http://localhost:9200/vulns
Different vulnerability scanners present time formats slightly different. It is a good idea to format it appropriately. For more information in the Elasticsearch docs check here. This is a sample API call you could make:

After the indexes are created you can run the script with XML output from a vulnerability scanner as input.
python VulntoES.py -i nessus_report_test_home.nessus -e 192.168.1.183 -r nessus

I have created a very simple dashboard in Kibana to visualize some of the vulnerabilities.


The script and dashboard can be found at my Github page:

AWS Glue, Fitbit and a "Health Data Lake" - part 1

A couple years ago I got a Charge HR Fitbit device. I have worn it off and on for the past couple years. It has been mildly entertaining to ...