For a few years I’ve been gathering data on my workouts. In Excel. It’s not exactly state of the art data architecture, but it was fine for a while. But data alone doesn’t do much. I wanted some questions answered.
Lately I’ve been hearing a lot about Apache Superset. (Well, I’ve been hearing lately about lots of products actually. It’s hard to choose one product to spend a lot of time on.) Apache Superset is open source data visualization software. I decided to give it a try for this particular problem.
Apart from the installation I demo most things in this video:
Starting on your laptop
If there’s anything I’ve learned recently working with Docker Desktop, it’s that it is often very easy to get a working environment of most open source data products in one or more Docker containers. Usually they have an image or docker-compose on their site somewhere to get you started. Same for Superset.
All you need to get started is Git and Docker Desktop (available for Windows, Mac and Linux). I use Git for Windows. To build your Dockerized Superset environment, just follow the instructions on the Superset documentation site:
This will start 6 Docker containers. One of them (named superset_db) runs a PostgreSQL 10 database, which contains the sample data, but also can be used to upload your CSV data. Another, superset_init, will only be used for installation and won’t run anymore after that. That is fine.
After the installation is done, go to http://localhost:8088/login/ or, in Docker Desktop, click here:
You can log in here with username admin, password admin. Now the Superset welcome screen will show. You can have a look at the example charts and dashboards.
Uploading my own dataset
Superset allows you to upload a CSV file. It will do this in the example (PostgreSQL) database. But first you need to edit the settings of the example database to allow uploads. Go to Data, Databases and click on edit:
Go to Advanced, Security and check Allow data upload. All this is explained in above video.
How to get a proper datetime column
I also explain also all the settings you can do to upload the CSV file. But there was a problem. PostgreSQL didn’t recognise the date and time format. Despite the “Infer Datetime Format” setting I had enabled.
An example of my date and time data was this: 2021-10-21 17:52:00
Because PostgreSQL didn’t recognise this as a datetime, Superset didn’t allow more advanced time-related features. For example, when I used a Time-series Bar Chart v2 or Time-series Area Chart v2, and I chose a Time Grain: week or year, it would come up with an “Unexpected error”:
Error: function date_trunc(unknown, text) does not exist LINE 1: SELECT DATE_TRUNC('year', "Datum") AS __timestamp, ^ HINT: No function matches the given name and argument types. You might need to add explicit type casts.
To see the problem, go to your dataset, More dataset options and Edit dataset:
In my case my Datum column was of the type “TEXT”, not “DATETIME”. It doesn’t matter that I chose this to be temporal data. You will run into this Unexpected error.
This is more of a PostgreSQL problem. To solve it, let’s go into the superset_db container and change this datatype.
You need to solve this on the command prompt. In Windows I’ve used Powershell for this.
Log in in the superset_db container with this command:
docker exec -it superset_db bash
Connect to the PostgreSQL database:
psql -h 127.0.0.1 -p 5432 -U superset
Check if you can access your table from here. In my case the table was called Workouts2. Because the name is case sensitive, I had to use double quotes around the name.
select * from "Workouts2";
This is how you change the data type of your date column. In my case it was called “Datum”. Here I create a temporary column called temp_date with the DATETIME data type. I then load the data from the “Datum” column:
ALTER TABLE "Workouts2" ADD COLUMN temp_date TIMESTAMP without time zone NULL; UPDATE "Workouts2" SET temp_date = "Datum"::TIMESTAMP; ALTER TABLE "Workouts2" ALTER COLUMN "Datum" TYPE TIMESTAMP without time zone USING temp_date; ALTER TABLE "Workouts2" DROP COLUMN temp_date;
After this small operation, you will have the same old “Datum” column, but now with the DATETIME data type. And Superset will notice this too:
Now you can really start using the fun time related features in Superset, which I demoed in the video.
My Superset experience
Once I had my datetime column set up, things really took off. Superset became really fun to use. I really was able to gain insights quickly. Superset is indeed very powerful. If possible I certainly will use it in the future.
My Superset dashboard looks awesome, but for every update to my Excel sheet, I need to upload it in Superset. That is not something I’m looking forward doing in the future.
In fact, I don’t like to enter my health data in Excel at all. Here is how I would like my data architecture to look like:
- I want an app on my iPad or iPhone to enter my weight data. (Now I use Evernote and copy that to Excel). The app uploads this data directly in a central database in my home. Possibly on a Raspberry Pi.
- I have a pipeline that retrieves new workout data from Strava and Polar and enters this data in my central database.
- On this central database I run Superset, where my dashboard is always up to date.
This will require quite some work and time. I have no idea how to create an app that works on my iPad or iPhone. I hope it can be done in Python, but even creating an app in Python is new ground for me.
If I can create this, I will share it on this blog. But don’t hold your breath just now.