Up to seven years ago my doctor would nag me every half year that I should lose some weight. Nagging didn’t work on me that much. What did work however was competition. I wanted to become faster in a bike race (actually, the cycling part in a relay triathlon). When I noticed that losing weight myself instead of buying an expensive new bike, I set out to do the first and I lost about 10 kilograms the first year. I won’t go into too many details, because I probably have talked about it too often. I’m a bit too proud about this.
This summer I followed an R Studio course on Udemy and when I finished it, I was thinking of doing a project with R Studio. I did gather a lot of health data in the last 6 years: weight, body fat, data from my heart rate monitor, a step count from my iPhone and sleep data. All this was stored in an Excel sheet. It’s not exactly Big data, but it has a couple of thousands rows now. Surely there’s a way to get some meaning out of it. I’ve already done a video about this, but it isn’t easy to copy commands from videos. And I’ve tried out some other stuff in this series.
How to start
I’ve used R Studio Desktop, which is free and can be downloaded here. As far as I can remember installation was pretty easy. I can imagine you have better things to do than download my health data, but if you want to try this out, a simple version of it is available. I wouldn’t trust Google or Facebook with it, but I trust it with you.
Preparing the data
Making a csv file from an Excel sheet was easy enough to do. But after playing around with it in R Studio, I decided to alter it a bit. For example: in my Excel sheet the weight I measured in the morning and the calories burned in the evening were in the same row. To be sure, there was no correllation between them. I’ve looked for a way to shift the exercise data in R Studio to a day later, but could not find anything I could get to work. I had a simular problem with calculating differentials between days. How hard could that be? I resorted to creating extra columns in Excel for that.
Loading the data in R Studio
Loading a csv file is one of the basics I’ve learned in the course. It’s as easy as running these two commands:
setwd("D:\\bestanden\\R") weight_data <- read.csv("healthdata_shifted.csv", header=TRUE)
What I’ve also learned, was that it’s a good idea to give names to the columns in R Studio. This makes your R code more readable.
colnames(weight_data) <- c("datumstr", "weight", "bodyfat_nocorr", "weight_delta", "BMI", "bodyfat", "bodyfat_delta", "TBW", "muscle", "energy", "bone", "walked_km", "biked_km", "kcal_burned", "training_load", "pushups", "sleep_start", "sleep_quality", "sleep_duration", "sleep_minutes","steps", "stairs" , "pulse", "bloodpress_low", "bloodpress_high", "commute_hours", "remarks", "food")
Now let’s tell R which columns are dates and numbers. Because otherwise your graphs will look really weird and unusable.
weight_data$date <- as.Date(weight_data$datumstr, "%d-%h-%y") weight_data$weight <- as.numeric(as.character(weight_data$weight)) weight_data$weight_delta <- as.numeric(as.character(weight_data$weight_delta)) weight_data$bodyfat_nocorr <- as.numeric(as.character(weight_data$bodyfat_nocorr)) weight_data$bodyfat <- as.numeric(as.character(weight_data$bodyfat)) weight_data$bodyfat_delta <- as.numeric(as.character(weight_data$bodyfat_delta)) weight_data$kcalsburned <- as.numeric(as.character(weight_data$kcal_burned)) weight_data$steps <- as.numeric(as.character(weight_data$steps)) weight_data$sleep_quality <- as.numeric(as.character(weight_data$sleep_quality)) weight_data$sleep_minutes <- as.numeric(as.character(weight_data$sleep_minutes)) weight_data$sleep_hours <- as.numeric(as.character(weight_data$sleep_minutes/60)) weight_data$bike_distance <- as.numeric(as.character(weight_data$biked_km)) weight_data$walking_distance <- as.numeric(as.character(weight_data$walked_km)) weight_data$stairs_climbed <- as.numeric(as.character(weight_data$stairs))
There will be some messages on converting the sleep data:
Warning message: NAs introduced by coercion
I haven’t used that sleep data much yet. I could not get a lot out of it anyway, so I’ve left that for now.
Having a first look at the data
R has useful commands to quickly check the data you have loaded: str and summary. str stands for structure (not string).
str(weight_data) 'data.frame': 2141 obs. of 37 variables: $ datumstr : Factor w/ 2136 levels "01-Apr-12","01-Apr-13",..: 1366 1512 1864 2005 2005 1 71 283 422 1053 ... $ weight : num 90.3 90.6 89.6 91.5 90.4 90.3 91 NA 90.2 NA ... $ bodyfat_nocorr : num 22.3 21.3 22.7 21.2 20.3 21.3 21.6 NA 23.8 NA ... $ weight_delta : num 0.7 0.3 -1 1.9 -1.1 -0.1 0.7 NA -0.8 NA ... $ BMI : num NA NA NA NA NA NA NA NA NA NA ... $ bodyfat : num 22.3 21.3 22.7 21.2 20.3 21.3 21.6 NA 23.8 NA ... $ bodyfat_delta : num -0.1 -1 1.4 -1.5 -0.9 1 0.3 NA 2.2 NA ... $ TBW : num NA NA NA NA NA NA NA NA NA NA ... $ muscle : num NA NA NA NA NA NA NA NA NA NA ...
summary gives a nice statistical overview.
summary(weight_data) datumstr weight bodyfat_nocorr weight_delta 02-Aug-14: 2 Min. :80.10 Min. :12.20 Min. :-3.70000 02-Oct-14: 2 1st Qu.:86.10 1st Qu.:20.00 1st Qu.:-0.50000 03-Aug-14: 2 Median :87.00 Median :21.50 Median : 0.00000 22-Jan-15: 2 Mean :86.71 Mean :22.01 Mean :-0.00429 29-Mar-12: 2 3rd Qu.:87.60 3rd Qu.:24.90 3rd Qu.: 0.50000 01-Apr-12: 1 Max. :93.10 Max. :28.00 Max. : 2.30000 (Other) :2130 NA's :227 NA's :240 NA's :229 BMI bodyfat bodyfat_delta TBW Min. :22.30 Min. :12.20 Min. :-9.1000 Min. :46.80 1st Qu.:22.60 1st Qu.:19.20 1st Qu.:-0.7000 1st Qu.:48.40 Median :22.90 Median :20.30 Median : 0.0000 Median :49.00 Mean :23.15 Mean :19.95 Mean :-0.0011 Mean :50.28 3rd Qu.:23.70 3rd Qu.:21.00 3rd Qu.: 0.7000 3rd Qu.:50.70 Max. :24.20 Max. :23.80 Max. : 7.9000 Max. :60.60 NA's :1971 NA's :240 NA's :240 NA's :1179
For example you can quickly see that my weight has been between 80.1 and 93.1 kilograms, but based on the 1st and 3rd Quartile you’d see my weight has mainly been between 86.1 and 87.6 kilograms.
Let’s get graphical
For this I use ggplot2. If you haven’t installed this package already, use this command:
To activate it in your code, use this:
So let’s make a graph of my weight data:
weightgraph <- ggplot(data=weight_data, aes(x=date, y=weight)) weightgraph + geom_point()
In the first line I tell what data and what columns to use for the x and y axis. The interesting thing about ggplot2 is that graphs are objects and you can add stuff to them. After running the first line you see nothing. Only in the second line where we tell to draw a graph with points, we get to see it.
What you see is that my weight has been ..ah.. seasonal. I lose weight in summer, when I go cycling and am generally outside. Come winter, when the weather isn’t ideal for cycling and when holidays bring all kinds of food, I gain weight. Only this year I managed to stay “light” (for now). The weather was very favorible for cycling this year. I also found a way to stay warm on the racing bike, even when temperatures drop below 10 degrees Celsius. So I’ve been doing longer rides even in November.
Let’s zoom in on the last year of data and let’s draw a line instead of points.
weightgraph + geom_line() + scale_x_date(limits = c(Sys.Date() - 365, NA))
As you can see there is a lot of noise. This isn’t the fault of my scale (weight measuring device). I’ve done serveral measurements in a row and the measured weight is always the same. There’s just day to day variance apparently.
It would be good to have a smoothed line to go with that. For this you use geom_smooth.
weightgraph + geom_line() + scale_x_date(limits = c(Sys.Date() - 365, NA)) + geom_smooth(fill=NA)
Notice the significant lower weight after the gap in July. That was my cycling holiday in the Pyrenees. I did about 900 kms in two weeks. It’s a great way to give your weight a “hard reset”.
Up to now I haven’t done much that can’t be done in Excel yet. In part 2 we’ll have a look at body fat percentages and the influence of different measurement devices.