Saturday, May 19, 2018

AWS Glue, Fitbit and a "Health Data Lake" - part 1

A couple years ago I got a Charge HR Fitbit device. I have worn it off and on for the past couple years. It has been mildly entertaining to observe my sleep patterns, track my steps, and benchmark my resting heart rate. As expected, consumers get the data from Fitbit in a formatted fashion; presented in tables, charts, and other visualizations. I began to wonder if the data that made up those visualizations was accessible to consumers in its raw format.

As it turns out, I was not the only person to wonder about this. After some googling I stumbled across two relevant articles by Stephen Hsu.

The first article explains how he collected the data and the second describes what he learned from the data. These two articles really kick-started my understanding of what the Fitbit API is capable of providing. Much thanks to Stephen for sharing these!

OAuth hack job

Fitbit provides an API where developers can query up to 30 days worth of historical data. Unfortunately, there is no standard API key for machine-to-machine access; it utilizes OAuth 2.0 for authorization. My goal was/is to regularly pull data in an automated fashion. No API key option was going to make this a total hack job. Thankfully, I am in Operations. Duct tape and bailing wire solutions are my speciality.  

A custom Python Fitbit client repo was linked to in the first article: https://github.com/orcasgit/python-fitbit. It uses 'cherrypy' to spin up a local web server that your browser will be redirected to cache OAuth credentials. Here is the sequence of steps I took to automate this:
  1. Create a EC2 t2.micro Ubuntu 16.04 instance
  2. Install xfce on for the UI using AWS guidelines here: https://aws.amazon.com/premiumsupport/knowledge-center/connect-to-ubuntu-1604-windows/
  3. Use VNC to log into the UI of the instance
  4. Install Firefox as the default browser
  5. Create a simple bash script that will execute my implementation of the Python client
    1. https://github.com/ChrisRimondi/fitbit_scripts/blob/master/get_health.sh
    2. https://github.com/ChrisRimondi/fitbit_scripts/blob/master/sample_env_vars.sh
  6. Run through the script once while I am logged into the UI. Log into Fitbit API developer portal when prompted by the script. (I realize this is isn't permanent but it buys me some automation.)
  7. Create a cron job to run the script daily so it retrieves my heart and sleep data without me needing to be in front of the machine. 
    1. You will see in the sample env vars linked to above that I had to source the browser and display. 
The script downloads the data to a CSV file and uploads that data to an S3 bucket. Given the limitations of Fitbit's API, you can only grab the previous 30 days of data, this provides me with a way to keep this data for perpetuity. 

More Automation Frustrations

There is a setting in the Fitbit application called 'All-Day Sync'. My assumption was that this setting would gradually synchronize my Charge HR with my phone all day long over Bluetooth and not require me to log into the application for synchronization. This seems to be only the case with heart rate data. Sleep data does not synchronize unless you open the Fitbit application on your phone and let it do its thing. This process can take several minutes. The likelihood of me remembering to do this every day is slim. Therefore, I run my automation script over the past week's worth of data to make sure I don't miss any dates.  

AWS Glue

The second article on analyzing the data does local analysis of the data using data frames in the Python Panda library. This is probably the most efficient way to analyze the data. Panda data frames are excellent and very intuitive for performing analytical operations on data. However, my goal was not efficiency but learning. I often like to combine different concepts to maximize learning. Glue is one AWS service that I have not had a lot of opportunity to utilize but have been interested in since it was announced at re:Invent a couple years ago. Therefore, while it is complete overkill I decided to use Glue so that I could learn the service and build the foundation for other types of data forming a type of personal "Health Data Lake" that is available for me to run analytics on. 

After my heart rate and sleep data was uploaded to S3, I would first need to add the data schema to a catalog so that other operations can operate on it. At first, all of my data would be residing in an S3 bucket with three folders:
  • sleep
  • heartrate
  • working
The latter category I added manually. This is a simple CSV file that keeps track of the date and whether I worked on that day. I think it would be interesting to see how whether I was working impacted different biometrics.

Glue has a very simple way of performing this categorization through its 'crawler' functions. AWS defines these as: "A crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your data catalog." The crawler click-through wizard was intuitive and it categorized my data correctly. I was then able to reference the data through AWS Glue context. 

From here I was able to create an ETL job that I ran on demand to access the sleep data and do a simple sum on the minutes I slept that night. That script can be found here: https://github.com/ChrisRimondi/fitbit_scripts/blob/master/fitbit_etl_job.py. The script then uploads the data to S3 in another CSV formatted file. It is this file I will continue to build upon by joining disparate data sources and then performing analytical operations on them. 

Note: The above script uses the Glue dynamic frames and Spark data frames. Given the very low volume of data this could have easily been accomplished with AWS Lambda. In the future, I might build some steps with Lambda. However, I chose the heavyweight Spark option vs Lambda to learn something new. 

Next Steps

As mentioned above I want to continue to use AWS Glue and build out a variety of ETL jobs to join different data sources. My initial format columnar format will something like this:

date | sleep minutes before | sleep minutes after | worked that day | average heart rate | step count

Past that it will be interesting to add what I ate during the day and exercise as well as environmental data such as what was the weather in my location. I will also want to post-process the sleep and heart data to add information such as longest sleep cycle in minutes and highest sustained heart rate to the tuple. Once I have a sufficient sized data set I could use simple ML algorithms to look for relationships between the columns in the tuple.

This has been a fun exercise for both learning the Fitbit API and AWS Glue. It sets the foundation for further exploration in this space. Stay tuned for more blog posts on the subject.  

1 comment:


  1. Great article by the great author, it is very massive and informative but still preaches the way to sounds like that it has some beautiful thoughts described so I really appreciate this article. fitbit band

    ReplyDelete

AWS Glue, Fitbit and a "Health Data Lake" - part 1

A couple years ago I got a Charge HR Fitbit device. I have worn it off and on for the past couple years. It has been mildly entertaining to ...