Diagram of heights of Olympic athletes. There's a big gap between short and tall athletes.

Profiling data with ydata in PySpark

When you got a dataset to explore, there are several ways to do that in PySpark. You can do a describe or a summary. But if you want something a little more advanced, and if you want to get a bit of a view of what is in there, you might want to go data profiling.

Older documentation might point you to Pandas profiling, but this functionality is now part of the Python package ydata-profiling (which is imported as ydata_profiling).

I’ve been following this blog on starting with ydata-profiling:

https://www.databricks.com/blog/2023/04/03/pandas-profiling-now-supports-apache-spark.html

Getting ydata-profiling to work is not exactly a walk in the park. You’d think you can just feed it your messy dataset and it will show you what the data is like. But I encountered some problems:

  • I got errors about missing Python packages in some situations.
  • ydata doesn’t seem to like dataframes with only string columns.
(more…)
A female data engineer frowning when looking at passed DQ checks

My experiences with Azure Purview

At my last customer I have extensively worked with Ataccama, a data management product. It has a data catalog to store metadata on datasets, and it can do data quality checks. In Azure Microsoft has a data management product too. It’s called Purview and I’ve used it in a PoC project. A very short intro into data management There’s more to data management than data catalogs and data quality, but I don’t want to rewrite Read more

Photorealistic image of a lakehouse

Things I learned about Azure Data Fabric

Currently I’m helping colleagues to read open data in Azure Data Fabric. Here are some of my experiences with it. I don’t want to do an extensive description of what Data Fabric is. In short, if you have an organisational Azure account, you can enable Data Fabric. You can then create Fabric workspaces and within workspaces you can create lakehouses for storage, pipelines and notebooks for automation. Lakehouses are like data lakes that act a Read more

A great time at PyCon Ireland 2024

I think it was last year when I announced that I wanted to go back to conferences again. Preferably as a speaker. But what conference is the best for data engineers? I couldn’t quite figure it out. Then the call for papers for PyCon Ireland 2024 came by on my socials and I thought “why not Python?” I do lots of it, even though it’s not always work related. And I’ve never been to Ireland. Read more

Using OCR to get data from my Robi scale

How it started For several years I kept track of my weight and fat with a Soehnle Body Balance, which I bought in 2018. That worked quite well until I saw more and more these weird deviations. Take a look at the red line (fat percentage) in the graph below: I’ve been training harder in the last 2 years, but according to the fat measurements I gained more fat, not less. And also, after a Read more

An e-ink display showing an amount of 837 euros with a field of tulips as background.

Showing a gift total on a Raspberry Pi with an e-ink display – how hard could it be?

TL;DR:

These Python and Raspberry Pi projects. They are fun aren’t they? And often they look deceptively simple. But you don’t see all the projects that failed and usually not where they struggled. This project got stuck (and almost failed) at:

  • Not being able to scrape dynamic website content.
  • When I found out how to do that, I couldn’t run my working Python code on the Raspberry Pi.
  • That turned out to be because the scraping packages use a chromium browser, but not for the ARM processor that the Raspberry Pi has.
  • And to top it all off, the Python package for the Inky Impression e-ink display had some kind of problem running numpy.
(more…)

How I memorise my lines (and other things) with Anki

In my spare time I do stage acting. And there is almost no better feeling having performed a play really well. But to do so, you need to learn your lines. Preferably you learn your text well in advance, so you also have time to play with it. Memorizing things is not something I’m particularly good at. For remembering things I have apps: calendars, Evernote, mail. But that doesn’t work when you’re on stage of Read more

A computer generated image of pipelines at sunset. Because .. future.

Categorising text with ChatGPT. Results may be messy.

I have a hobby project I’m working on. It’s an astronomy news feed reader. Long story short: I currently gather links to interesting articles about astronomy by hand. And I want to automate this, so that I have more time to actually read the news. What I want is that an article, based on its contents, will be tagged with a couple of keywords. And also that it is placed in one main category. For Read more

A Strava dashboard on a Raspberry Pi (Part 3): The Strava API

This is part 3 of a series of blogposts on how I created a Strava dashboard on a Inky Impression e-ink display with a Raspberry Pi.

OAuth2

This was the part that I expected to be the hard part: getting my data from Strava. Or, to be more precise: getting the connection right so the Strava API would allow me to get that data. Because it requires authentication via the OAuth2 protocol and I’ve tried a similar thing a few years back with a Google API and I just didn’t get it. But now I do.

Strava API documentation

It requires a whole “dance” between your computer code and the Strava API where you exchange all kinds of tokens back and forth. Strava’s Getting Started with the Strava API document explains it quite well. And this blogpost by Graziano Fuccio helped me a lot with the Python code: http://www.grace-dev.com/python-apis/strava-api/.

Frustratingly I still didn’t get it to work though. The reason I found out, is because the URL of the authentication has changed. From https://www.strava.com/oauth/token it became  https://www.strava.com/api/v3/oauth/token. I found this elsewhere in the Stava API documentation, where the correct URL was shown. I’ve told Strava that their Getting Started documentation is outdated. They asked me to create a ticket and I’ve done so, but I don’t think they changed their document yet. But Graziano Fuccio did though.

(more…)

A Strava dashboard on a Raspberry Pi (Part 2): Installing software

In last blogpost we set up the Raspberry Pi, attached the Inky Impression display and got the Raspberry Pi ready for remote access.

Time to get the Inky Impression software installed and make the Inky Impression screen display something.

Your SSH connection of choice

For this we’re going to have to run some commands via remote SSH. There are multiple ways to log in remotely. You can use a tool like Putty or the terminal on MacOS (I like iTerm2). That’s actually simpler.

But I chose to use Visual Studio Code because you can edit Python code remotely via SSH straight on the Raspberry Pi.

To do this you must install Visual Studio Code. Visual Studio Code has all kinds of extensions. Here we will install the Remote – SSH extension. And while you’re at it, maybe install the Python extension as well, because we will be writing some Python later.

Installing the Remote-SSH extension in Visual Studio Code
(more…)