Chris Rimondi: AWS Glue, Fitbit and a "Health Data Lake"

A couple years ago I got a Charge HR Fitbit device. I have worn it off and on for the past couple years. It has been mildly entertaining to observe my sleep patterns, track my steps, and benchmark my resting heart rate. As expected, consumers get the data from Fitbit in a formatted fashion; presented in tables, charts, and other visualizations. I began to wonder if the data that made up those visualizations was accessible to consumers in its raw format.

As it turns out, I was not the only person to wonder about this. After some googling I stumbled across two relevant articles by Stephen Hsu.

The first article explains how he collected the data and the second describes what he learned from the data. These two articles really kick-started my understanding of what the Fitbit API is capable of providing. Much thanks to Stephen for sharing these!

OAuth hack job

Fitbit provides an API where developers can query up to 30 days worth of historical data. Unfortunately, there is no standard API key for machine-to-machine access; it utilizes OAuth 2.0 for authorization. My goal was/is to regularly pull data in an automated fashion. No API key option was going to make this a total hack job. Thankfully, I am in Operations. Duct tape and bailing wire solutions are my speciality.

A custom Python Fitbit client repo was linked to in the first article: https://github.com/orcasgit/python-fitbit. It uses 'cherrypy' to spin up a local web server that your browser will be redirected to cache OAuth credentials. Here is the sequence of steps I took to automate this:

Create a EC2 t2.micro Ubuntu 16.04 instance
Install xfce on for the UI using AWS guidelines here: https://aws.amazon.com/premiumsupport/knowledge-center/connect-to-ubuntu-1604-windows/
Use VNC to log into the UI of the instance
Install Firefox as the default browser
Create a simple bash script that will execute my implementation of the Python client

Run through the script once while I am logged into the UI. Log into Fitbit API developer portal when prompted by the script. (I realize this is isn't permanent but it buys me some automation.)
Create a cron job to run the script daily so it retrieves my heart and sleep data without me needing to be in front of the machine.

You will see in the sample env vars linked to above that I had to source the browser and display.

The script downloads the data to a CSV file and uploads that data to an S3 bucket. Given the limitations of Fitbit's API, you can only grab the previous 30 days of data, this provides me with a way to keep this data for perpetuity.

More Automation Frustrations

There is a setting in the Fitbit application called 'All-Day Sync'. My assumption was that this setting would gradually synchronize my Charge HR with my phone all day long over Bluetooth and not require me to log into the application for synchronization. This seems to be only the case with heart rate data. Sleep data does not synchronize unless you open the Fitbit application on your phone and let it do its thing. This process can take several minutes. The likelihood of me remembering to do this every day is slim. Therefore, I run my automation script over the past week's worth of data to make sure I don't miss any dates.

AWS Glue

The second article on analyzing the data does local analysis of the data using data frames in the Python Panda library. This is probably the most efficient way to analyze the data. Panda data frames are excellent and very intuitive for performing analytical operations on data. However, my goal was not efficiency but learning. I often like to combine different concepts to maximize learning. Glue is one AWS service that I have not had a lot of opportunity to utilize but have been interested in since it was announced at re:Invent a couple years ago. Therefore, while it is complete overkill I decided to use Glue so that I could learn the service and build the foundation for other types of data forming a type of personal "Health Data Lake" that is available for me to run analytics on.

After my heart rate and sleep data was uploaded to S3, I would first need to add the data schema to a catalog so that other operations can operate on it. At first, all of my data would be residing in an S3 bucket with three folders:

sleep
heartrate
working

The latter category I added manually. This is a simple CSV file that keeps track of the date and whether I worked on that day. I think it would be interesting to see how whether I was working impacted different biometrics.

Glue has a very simple way of performing this categorization through its 'crawler' functions. AWS defines these as: "A crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your data catalog." The crawler click-through wizard was intuitive and it categorized my data correctly. I was then able to reference the data through AWS Glue context.

From here I was able to create an ETL job that I ran on demand to access the sleep data and do a simple sum on the minutes I slept that night. That script can be found here: https://github.com/ChrisRimondi/fitbit_scripts/blob/master/fitbit_etl_job.py. The script then uploads the data to S3 in another CSV formatted file. It is this file I will continue to build upon by joining disparate data sources and then performing analytical operations on them.

Note: The above script uses the Glue dynamic frames and Spark data frames. Given the very low volume of data this could have easily been accomplished with AWS Lambda. In the future, I might build some steps with Lambda. However, I chose the heavyweight Spark option vs Lambda to learn something new.

Next Steps

As mentioned above I want to continue to use AWS Glue and build out a variety of ETL jobs to join different data sources. My initial format columnar format will something like this:

Past that it will be interesting to add what I ate during the day and exercise as well as environmental data such as what was the weather in my location. I will also want to post-process the sleep and heart data to add information such as longest sleep cycle in minutes and highest sustained heart rate to the tuple. Once I have a sufficient sized data set I could use simple ML algorithms to look for relationships between the columns in the tuple.

This has been a fun exercise for both learning the Fitbit API and AWS Glue. It sets the foundation for further exploration in this space. Stay tuned for more blog posts on the subject.

Chris Rimondi

Saturday, May 19, 2018

AWS Glue, Fitbit and a "Health Data Lake" - part 1

OAuth hack job

More Automation Frustrations

AWS Glue

Next Steps

1 comment:

AWS Glue, Fitbit and a "Health Data Lake" - part 1

Report Abuse