R Studio: Doing data science on my health data – Part 1

Up to seven years ago my doctor would nag me every half year that I should lose some weight. Nagging didn’t work on me that much. What did work however was competition. I wanted to become faster in a bike race (actually, the cycling part in a relay triathlon). When I noticed that losing weight myself instead of buying an expensive new bike, I set out to do the first and I lost about 10 kilograms the first year. I won’t go into too many details, because I probably have talked about it too often. I’m a bit too proud about this.

This summer I followed an R Studio course on Udemy and when I finished it, I was thinking of doing a project with R Studio. I did gather a lot of health data in the last 6 years: weight, body fat, data from my heart rate monitor, a step count from my iPhone and sleep data. All this was stored in an Excel sheet. It’s not exactly Big data, but it has a couple of thousands rows now. Surely there’s a way to get some meaning out of it. I’ve already done a video about this, but it isn’t easy to copy commands from videos. And I’ve tried out some other stuff in this series.

How to start

I’ve used R Studio Desktop, which is free and can be downloaded here. As far as I can remember installation was pretty easy. I can imagine you have better things to do than download my health data, but if you want to try this out, a simple version of it is available. I wouldn’t trust Google or Facebook with it, but I trust it with you.


Preparing the data

Making a csv file from an Excel sheet was easy enough to do. But after playing around with it in R Studio, I decided to alter it a bit. For example: in my Excel sheet the weight I measured in the morning and the calories burned in the evening were in the same row. To be sure, there was no correllation between them. I’ve looked for a way to shift the exercise data in R Studio to a day later, but could not find anything I could get to work. I had a simular problem with calculating differentials between days. How hard could that be? I resorted to creating extra columns in Excel for that.


Loading the data in R Studio

Loading a csv file is one of the basics I’ve learned in the course. It’s as easy as running these two commands:

weight_data <- read.csv("healthdata_shifted.csv", header=TRUE)

What I’ve also learned, was that it’s a good idea to give names to the columns in R Studio. This makes your R code more readable.

colnames(weight_data) <- c("datumstr", "weight", "bodyfat_nocorr", "weight_delta",
"BMI", "bodyfat", "bodyfat_delta", "TBW", "muscle", "energy",
"bone", "walked_km", "biked_km", 
"kcal_burned", "training_load", "pushups",
"sleep_start", "sleep_quality", "sleep_duration", 
"sleep_minutes","steps", "stairs" , "pulse", "bloodpress_low",
"bloodpress_high", "commute_hours", "remarks", "food")

Now let’s tell R which columns are dates and numbers. Because otherwise your graphs will look really weird and unusable.

weight_data$date <- as.Date(weight_data$datumstr, "%d-%h-%y")
weight_data$weight <- as.numeric(as.character(weight_data$weight))
weight_data$weight_delta <- as.numeric(as.character(weight_data$weight_delta))
weight_data$bodyfat_nocorr <- as.numeric(as.character(weight_data$bodyfat_nocorr))
weight_data$bodyfat <- as.numeric(as.character(weight_data$bodyfat))
weight_data$bodyfat_delta <- as.numeric(as.character(weight_data$bodyfat_delta))
weight_data$kcalsburned <- as.numeric(as.character(weight_data$kcal_burned))
weight_data$steps <- as.numeric(as.character(weight_data$steps))
weight_data$sleep_quality <- as.numeric(as.character(weight_data$sleep_quality))
weight_data$sleep_minutes <- as.numeric(as.character(weight_data$sleep_minutes))
weight_data$sleep_hours <- as.numeric(as.character(weight_data$sleep_minutes/60))
weight_data$bike_distance <- as.numeric(as.character(weight_data$biked_km))
weight_data$walking_distance <- as.numeric(as.character(weight_data$walked_km))
weight_data$stairs_climbed <- as.numeric(as.character(weight_data$stairs))

There will be some messages on converting the sleep data:

Warning message:
NAs introduced by coercion

I haven’t used that sleep data much yet. I could not get a lot out of it anyway, so I’ve left that for now.


Having a first look at the data

R has useful commands to quickly check the data you have loaded: str and summary. str stands for structure (not string).


'data.frame':	2141 obs. of  37 variables:
 $ datumstr        : Factor w/ 2136 levels "01-Apr-12","01-Apr-13",..: 1366 1512 1864 2005 2005 1 71 283 422 1053 ...
 $ weight          : num  90.3 90.6 89.6 91.5 90.4 90.3 91 NA 90.2 NA ...
 $ bodyfat_nocorr  : num  22.3 21.3 22.7 21.2 20.3 21.3 21.6 NA 23.8 NA ...
 $ weight_delta    : num  0.7 0.3 -1 1.9 -1.1 -0.1 0.7 NA -0.8 NA ...
 $ BMI             : num  NA NA NA NA NA NA NA NA NA NA ...
 $ bodyfat         : num  22.3 21.3 22.7 21.2 20.3 21.3 21.6 NA 23.8 NA ...
 $ bodyfat_delta   : num  -0.1 -1 1.4 -1.5 -0.9 1 0.3 NA 2.2 NA ...
 $ TBW             : num  NA NA NA NA NA NA NA NA NA NA ...
 $ muscle          : num  NA NA NA NA NA NA NA NA NA NA ...

summary gives a nice statistical overview.


      datumstr        weight      bodyfat_nocorr   weight_delta     
 02-Aug-14:   2   Min.   :80.10   Min.   :12.20   Min.   :-3.70000  
 02-Oct-14:   2   1st Qu.:86.10   1st Qu.:20.00   1st Qu.:-0.50000  
 03-Aug-14:   2   Median :87.00   Median :21.50   Median : 0.00000  
 22-Jan-15:   2   Mean   :86.71   Mean   :22.01   Mean   :-0.00429  
 29-Mar-12:   2   3rd Qu.:87.60   3rd Qu.:24.90   3rd Qu.: 0.50000  
 01-Apr-12:   1   Max.   :93.10   Max.   :28.00   Max.   : 2.30000  
 (Other)  :2130   NA's   :227     NA's   :240     NA's   :229       
      BMI           bodyfat      bodyfat_delta          TBW       
 Min.   :22.30   Min.   :12.20   Min.   :-9.1000   Min.   :46.80  
 1st Qu.:22.60   1st Qu.:19.20   1st Qu.:-0.7000   1st Qu.:48.40  
 Median :22.90   Median :20.30   Median : 0.0000   Median :49.00  
 Mean   :23.15   Mean   :19.95   Mean   :-0.0011   Mean   :50.28  
 3rd Qu.:23.70   3rd Qu.:21.00   3rd Qu.: 0.7000   3rd Qu.:50.70  
 Max.   :24.20   Max.   :23.80   Max.   : 7.9000   Max.   :60.60  
 NA's   :1971    NA's   :240     NA's   :240       NA's   :1179

For example you can quickly see that my weight has been between 80.1 and 93.1 kilograms, but based on the 1st and 3rd Quartile you’d see my weight has mainly been between 86.1 and 87.6 kilograms.


Let’s get graphical

For this I use ggplot2. If you haven’t installed this package already, use this command:


To activate it in your code, use this:


So let’s make a graph of my weight data:

weightgraph <- ggplot(data=weight_data, aes(x=date, y=weight))
weightgraph + geom_point()

In the first line I tell what data and what columns to use for the x and y axis. The interesting thing about ggplot2 is that graphs are objects and you can add stuff to them. After running the first line you see nothing. Only in the second line where we tell to draw a graph with points, we get to see it.

My weight data since 2012.

What you see is that my weight has been ..ah.. seasonal. I lose weight in summer, when I go cycling and am generally outside. Come winter, when the weather isn’t ideal for cycling and when holidays bring all kinds of food, I gain weight. Only this year I managed to stay “light” (for now). The weather was very favorible for cycling this year. I also found a way to stay warm on the racing bike, even when temperatures drop below 10 degrees Celsius. So I’ve been doing longer rides even in November.

Let’s zoom in on the last year of data and let’s draw a line instead of points.

weightgraph + geom_line() +
scale_x_date(limits = c(Sys.Date() - 365, NA))

The noise in my weight data are actually normal day-to-day variations you can expect.

As you can see there is a lot of noise. This isn’t the fault of my scale (weight measuring device). I’ve done serveral measurements in a row and the measured weight is always the same. There’s just day to day variance apparently.

It would be good to have a smoothed line to go with that. For this you use geom_smooth.

weightgraph + geom_line() +
scale_x_date(limits = c(Sys.Date() - 365, NA)) + geom_smooth(fill=NA)

Adding a smoothed line to my weight data.

Notice the significant lower weight after the gap in July. That was my cycling holiday in the Pyrenees. I did about 900 kms in two weeks. It’s a great way to give your weight a “hard reset”.

Up to now I haven’t done much that can’t be done in Excel yet. In part 2 we’ll have a look at body fat percentages and the influence of different measurement devices.

About Marcel-Jan Krijgsman

In 2017 I made the leap to Big Data after 20 years of experience with Oracle databases. I followed courses on Hadoop, Big Data Analytics, Machine Learning and Python, MongoDB and Elasticsearch.
This entry was posted in Learning Big Data, Weird experiments and tagged , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.