A great time at PyCon Ireland 2024

I think it was last year when I announced that I wanted to go back to conferences again. Preferably as a speaker. But what conference is the best for data engineers? I couldn’t quite figure it out. Then the call for papers for PyCon Ireland 2024 came by on my socials and I thought “why not Python?” I do lots of it, even though it’s not always work related. And I’ve never been to Ireland. So submitted two sessions. One got selected right away. I booked my flight and hotel and off I went last Friday (November 15 2024).

Day 1

Let me first of all say that I found the quality of the presentations very good. They were interesting and I was able to follow the topics quite well.

Jaroslav Bezděk started off with a talk about pandas and DuckDB. I know pandas, but I wanted to know more about the second one. Jaroslav’s talk confirmed for me what I already suspected: it is not that hard to start with DuckDB. Nice to see some examples. Certainly a tool I want to try out. I found his slides on Github: https://github.com/jardabezdek/talk-zoology-101.

Then it was my turn to talk about e-ink displays. I’ve grown fond of these devices, combined with Raspberry Pi’s. So I discussed 4 ways that I’ve used them. The slides can be found on my Github page: https://github.com/Marcel-Jan/talk-eink-dashboards.

It was so much fun to be a speaker at a conference again and the audience was very welcoming. Many people talked to me after this and my other presentation and it really makes me want to do this more often.

Mihai Creveti talked about what AI agents can and cannot do. I’ve got a clearer picture now what agents are good for. He also gave examples of tools that are used for this currently. There’s so much in the AI landscape nowadays. It’s good to know where the field is going.

The talk about testable pipelines by Florian Stefan was of great interest to me. He showed how he uses DBT (Data Build Tool) with dbt_expectations for testing of pipelines. dbt_expectations is not only useful for testing. It can also check if quantiles of column values fall within an expected values. So that goes further than just of they have expected values.

Mark Smith from MongoDB demoed an AI agent that can send real world text messages with excuses why he could not make it to the office. And by giving that agent a memory, you can make sure it won’t send the same excuses twice. Weird use case, but otherwise a clear application.

Like me, Cosmin Marian Paduraru, has used Python to solve a personal use case. He wanted to know if he could use visual intelligence to identify new items for his collection of bottle caps and avoid duplicates. He showed what technology he used and what obstacles he encountered. And he showed you don’t always need the latest and largest algorithm for this kind of work. And not a bad try for his first presentation at a conference ever.

James Shields from Bank of America talked about how to get a culture of innovation at your company. I speak from experience that getting a culture of innovation can be hard. It’s hard to get the time, get everyone involved, including management. And even if everyone is willing to innovate, it doesn’t always happen. At Bank of America they use hackathons. And even that is not everyone’s cup of tea. But still they are making great progress.

The Github ecosystem does not only support DevOps but it can also support DevSecOps. That’s what the talk by Eoin and Tom Halpin was about. They showed with a down to earth example how they use Github Actions and Workflows to not only do automated testing, but also do vulnerability scans. You can find their repo here: https://github.com/genai-musings/chatting-with-ChatGPT. I finally understand what these badges are for on the Github page. Certainly something I want to try out BTW.

Next I went to Paul Minogue’s presentation about vector databases. He discussed what vector databases are good for and shared his research on this matter. There were some surprises for me. For example that OpenSearch can be applied as a good vector data store. He also shared the challenges he encountered. I already knew about embeddings, but I’ve learned a lot about the ways you can search through embeddings if there are a lot of them and performance is not good enough.

Florenz Hollebrandse discussed the modular approach they use at JPMorganChase to make sure that when choosing solutions they don’t paint themselves into a corner. They decouple business logic from platform/deployment concerns. This way they are able to reuse more generic software. JPMorganChase open sourced a solution for this, which you can find on Github: https://github.com/jpmorganchase/inference-server.

Then it was my turn again. I was asked if I could prepare my other submission as a backup presentation. So I finished my presentation on how I used Python to prepare for my astronomy podcast the evening before, on my hotelroom. I had a lot to share about how I use ChatGPT to categorise astronomy news articles, how I use embeddings to find similar articles that don’t need categorising anymore (reduces the bill). And flattening the embeddings to 2D or 3D allows you to make nice graphs. My slides can be found here: https://github.com/Marcel-Jan/talk-python-astropodcast

And then there were the lightning talks where people can quickly share a topic of interest. It doesn’t always have to be directly Python related. That’s why the Swedish vessel called Vasa that sunk fairly quickly after barely leaving the harbour (“I’ve been projects on this before”). But also a daring demo of pre-commit, a tool that won’t let you commit unless your code adheres to certain standards. And about creating a Telegram bot with Firestore.

The day ended with pizza and fries. And good conversations with people I didn’t know before. This is such a nice community with many people wanting to share their knowledge with everyone else. It’s heart warming.

Day 2

On day 2 you could follow all kinds of workshops. And yes, a lot of them were RAG and AI agent related. A nice chance to try that out.

So I learned to use a vector database (MongoDB) and RAG to create an agent. And an important concept I learned here was chunking, which is used to break up text to make it possible for the agent to work faster.

Then I thought about following a workshop on scraping, but the room was already very full. So why not do more RAG? You can never do enough RAG. So I followed Shekhar Koirala’s and Shushanta Pudasaini’s workshop, which was about multi-model RAG. What that meant was that you feed your RAG software a PDF and it will separately get text, images and tables out of it. Which you then later can use in an agent.

And lastly I followed Cheuk Ting Ho’s workshop on Polars with Polar extensions. Polars is pandas’ faster sister. For this it has been programmed in Rust. So for the first time I’ve installed Rust on my laptop. I managed to follow the entire workshop. There’s still a lot I need to know before I can implement this on site. But I’ve got a bit of the hang of it.

And that was the end of Pycon Ireland 2024. I must say I’ve enjoyed it very much.

I stepped out of the Raddison Blu and finding myself to find a new purpose of the next of the day. I decided to explore the area of Trinity College a little bit. Because tomorrow I want to visit the Old Library. And a couple of other sights in the area.

I’ve also very much enjoyed dinner. I went to The Winding Stair next to the river Liffey. What an excellent restaurant. I’ll likely visit them again later this week.

Posted in Conferences, Python | Tagged , , , | Leave a comment

Using OCR to get data from my Robi scale

How it started

For several years I kept track of my weight and fat with a Soehnle Body Balance, which I bought in 2018. That worked quite well until I saw more and more these weird deviations. Take a look at the red line (fat percentage) in the graph below:

I’ve been training harder in the last 2 years, but according to the fat measurements I gained more fat, not less. And also, after a day of a long bike ride, the fat percentage would peak the next day, instead of getting lower. In the last few months I would regularly get fat percentage measurements of 30+%. And it was not like I was eating burgers, fries and ice cream everyday. It didn’t look like the fat measurements were very accurate anymore.

My new scale

I decided it was time for a new personal scale. After some deliberation I picked the Robi S11. It is a “Smart body composition scale” according to the brochure. It has a handheld device that measures your body fat (and a whole lot of other things) more accurately. It is similar to how my doctor measures my fat percentage during my half yearly checkup. And it was moderately priced.

Now this is one of those scales that has a Bluetooth connection. I’ve always had a healthy mistrust of sharing my health data with apps like these. Especially when the parent company is one Guandong Icomon Technology. Who knows where your data goes to and how securely it is stored?

I decided to give their Fitdays app a try anyway. I filled in the limited amount of personal details (and not all of them entirely accurate). And of course I didn’t give the app any more access to iPhone data than absolutely necessary. For what it’s worth.

The device does an impressive amount of measurements. It measures not just weight, fat, water and muscle tissue. It can do so per arm and leg. And somehow it also can measure bone mass and protein mass in your body. Not sure how accurate and scientific all this is though.

The app shows all these results. And then came the little matter of me wanting to copy all that data. Luckily the app has a “share” option. I was able to Airdrop that data to my MacBook. So I was excited… until I got said data. Because it was in the form of a jpeg file.

Example of the data in jpeg form (only top part because the image is very long).

Your data, in jpeg form

You can’t copy the values. You can’t get the data in any other form. Good luck!

Good luck? Well we’ll see about that. I decided to summon the power of Python! Surely there must be some way to OCR the heck out of this jpg? And, as almost ever, there is a Python solution. Quite quickly I learned there is a Python package called pytesseract that can do OCR.

Using pytesseract for OCR

For a first attempt the code is fairly simple:

import pytesseract
from PIL import Image

im = Image.open("IMG_69EC2B66C329-1.jpeg") # the ROBI image with data
text = pytesseract.image_to_string(im)


And sure enough, when you run it, you get this result:

83.2 kg 18.7 %

Gewicht Lichaamsvet

Indicator Waarde Standaard

Gewicht 83.2kg Standaard
BMI 23.0 Standaard
Lichaamsvet 18.7% Standaard

Vetmassa 15.6kg Standaard


lichaamsgewicht 878k

Spiermassa 63.1kg Standaard
Spiersnelheid 75.8% Standaard
Skeletspier 46.5% Standaard
Botmassa 4.5kg Standaard
Eiwitmassa 13.5kg Standaard
Eiwit 16.2% Standaard
Watergewicht 49.6kg Standaard
Lichaamswater 59.6% Standaard
Onderhuids vet 13.4% Standaard
Visceraal vet 5.0 Standaard
BMR 1830kcal

Lichaamsleeftijd 52 Uitstekend

WHR 0.90 Standaard

Now all I have to do is select the lines with the data that I want, write it to a cleaned up data output, and I have my data in consumable form.

I got a lot of data out of this. But not all. For example, on this multiline name it would get the text, but value was wrong:

As you can see in the result here:


lichaamsgewicht 878k

It was probably confused by the value being in the middle of the multiline name?

Also it would not get the text from this part with the human image:

It would not get the numbers here (except the “Standard range”):

Segmentale vetanalyse

Standaardbereik: 80%-160%


Standaard \\ Standaard
l R l

Maybe that’s something to look into in a later phase.

In any case, I was pretty happy about how easy it was to get the first results. I got enough out of it to start with. Hiding my data in a jpeg is no match for some rudimentary Python skills anymore.

I’ve put my Python code in a Github repository: https://github.com/Marcel-Jan/extract_fitdays_data

Further research

I’ve been thinking how to improve the quality of the results from pytesseract. One approach is to cut parts of the image out, so it can “focus” on these.

But I also read you can do other forms of preprocessing of the image that can help. Like what I read in this post:


I also want to store my data in a .sqlite database in the future. Now it’s still an Excel sheet. But I could do more in SQL. Maybe make a data warehouse of my own personal data.

To be continued.

Posted in Python | Tagged , , , | 2 Comments

Showing a gift total on a Raspberry Pi with an e-ink display – how hard could it be?


These Python and Raspberry Pi projects. They are fun aren’t they? And often they look deceptively simple. But you don’t see all the projects that failed and usually not where they struggled. This project got stuck (and almost failed) at:

  • Not being able to scrape dynamic website content.
  • When I found out how to do that, I couldn’t run my working Python code on the Raspberry Pi.
  • That turned out to be because the scraping packages use a chromium browser, but not for the ARM processor that the Raspberry Pi has.
  • And to top it all off, the Python package for the Inky Impression e-ink display had some kind of problem running numpy.
Continue reading
Posted in Howto, Python | Tagged , , , , , , | Leave a comment

How I memorise my lines (and other things) with Anki

In my spare time I do stage acting. And there is almost no better feeling having performed a play really well. But to do so, you need to learn your lines. Preferably you learn your text well in advance, so you also have time to play with it.

Memorizing things is not something I’m particularly good at. For remembering things I have apps: calendars, Evernote, mail. But that doesn’t work when you’re on stage of course.

Previously, to train myself, I would record the scenes I was in, without the parts I was supposed to say. And then I would replay that recording and try to fill in the gaps. But it took quite a number of turns to get that right.

Last production I tried something else: spaced repetition. The principle of this is that you space out review sessions where hard questions return earlier and easy questions reappear in your “deck” in later review sessions.

For this I’ve used open source software called Anki. In Anki you create flashcards and Anki will run study sessions where it will determine based on spaced repetition which cards should be reviewed.

In Anki you create a deck. In my case the deck has the name of the play/production I’m studying for.

Here is a flashcard for a hypothetical play called Project Hail Mary. (Unfortunately the Venn diagram between stage actors and sci-fi nerds doesn’t seem to intersect all that much, based on my past experiences. It is still my dream to bring sci-fi to the stage some day.).

I use the line of the actor before me as the question, and my line as the answer. If I have a lot of text, or even have a monologue, I use my previous lines as question and the next line as the answer. I try to keep the amount of text in the answer limited if possible.

I use tags to add a card to a scene. It’s okay that Anki goes through these cards randomly.

When have added your cards and are finished adding more, you can start a study session:

Anki will start showing you the question. You don’t have to answer it. You have to memorise what the answer is.

Then you click “Show answer” and you can select how hard or easy you thought that answering was. If you choose easy, it might take multiple days before the question returns. If you choose hard, it can come back in minutes.

When you have gone through the deck you get this message. And you’re done for the day.

It’s up to you how soon you want to check in with Anki again. Anki doesn’t have some kind of scheduling system. It’s completely up to you to fire it up when you feel the need to study again.

But if you fire up Anki a couple of times per week, you will find that you will get better at it. And it takes significantly less time than my previous method.

And of course this method can be used for many things you need to memorise. Like that Azure DP-203 exam I have on my todo list somewhere.

Posted in Active Learning, Weird experiments | Tagged , , , , , | Leave a comment

Categorising text with ChatGPT. Results may be messy.

I have a hobby project I’m working on. It’s an astronomy news feed reader. Long story short: I currently gather links to interesting articles about astronomy by hand. And I want to automate this, so that I have more time to actually read the news.

What I want is that an article, based on its contents, will be tagged with a couple of keywords. And also that it is placed in one main category. For example: an article on the discovery of volcanos on Venus can have tags like “Venus”, “vulcanism”, “Magellan” and the main category is “Venus”.

So how to do that in Python? I’ve looked it up and according to Stack Overflow I need to start reading books on data mining and Natural Language Text Processing. Hmm, no. Not for a hobby project. So I was wondering. Can’t ChatGPT do it?

Homer Simpson with his campaign for sanitation commissioner: “Can’t Someone Else Do It?” in episode “Trash of the Titans”

Can’t ChatGPT do it?

Short answer: it can. Here is how I did it. First I learned how to use ChatGPT with Python from Harisson Kinsley’s video on this topic:

YouTube video on how to run simple ChatGPT queries from Python.

As Harrison explains in his video: you are going to need an OpenAI account. And using ChatGPT from Python costs money. Check the pricing here: https://openai.com/pricing. So far I’ve spent a handful of dollarcents for about 50 ChatGPT queries. But if you’re going to experiment with this, don’t forget to set some usage limits, so you won’t get unpleasant surprises.

Creating tags for a text with ChatGPT

The idea is to send ChatGPT a question with the text that needs to be categorised in it. So I create a string “Categorise this text with one to five tags: <text of the article here>”.

newstext = "As Canada celebrates its first astronaut to go to the moon, it is starting a new project that could eventually enable a Canadian to walk on the lunar surface. <more text here>"

user_input = f"Categorise this text with one to five tags:\n\n {newstext}"

message_history.append({"role": "user", "content": f"{user_input}"})

completion = openai.ChatCompletion.create(

reply_content = completion.choices[0].message.content

And sure enough, it works! Here is are the tags it creates for this article on the Ariane 5 rocket for ESA’s JUICE mission to Jupiter and its moons: https://www.spacedaily.com/reports/Ariane_5_flight_VA260_Juice_fully_integrated_and_ready_for_rollout_999.html:

space mission, Ariane 5, Juice mission, Jupiter exploration, European Spaceport.

Yes, that looks like a good result.

Here is another article and what tags ChatGPT makes for it: https://www.newscientist.com/article/2367734-tonight-is-your-best-chance-to-see-mercury-in-the-night-sky/

Mercury, solar system, astronomical observations, space viewing, celestial events. 

Yup, those are pretty good tags.

And then this happened for this article: https://www.spacewar.com/reports/Thule_Air_Base_Gets_New_Name_999.html

- U.S. Space Force
- Greenland
- Department of Defense
- Pituffik Space Base
- Cultural heritage

Wait, what? Why the dashes all of a sudden? And this time no period at the end.

So ChatGPT 3.5’s results can be as inconsistent as when humans do this thing. Results get better when you tell ChatGPT to deliver the tags as comma delimited:

user_input = f"Categorize this text with one to five tags:\n\n {item['summary_detail']}." \
             f" Print the tags separated by commas."

But that’s the lesson here: you have do be very clear and specific about what you want.

One main category from a list

Now I want ChatGPT to pick one main category from a list I have picked. This is the list:

astro_categories = "Mercury, Venus, Moon, Earth, Mars, " \
                   "Jupiter, Saturn, Uranus, Neptune, " \
                   "Pluto and the Kuiper Belt, Comets, " \
                   "Exoplanets, Formation of the Solar System, " \
                   "Telescopes, Meteorites, " \
                   "Artificial Intelligence, Miscellaneous"

Because you can have a message history in ChatGPT, I can ask follow-up questions, based on the text I gave it earlier.

user_followup = f"Also categorize this text in one of the following categories:\n\n {astro_categories}"
message_history.append({"role": "user", "content": f"{user_followup}"})

completion = openai.ChatCompletion.create(

reply_content = completion.choices[0].message.content

Let’s see what ChatGPT makes of it.

For the article about the Ariane 5 rocket for the JUICE mission, it picks:


That’s good. I didn’t ask for the period at the end, but I can work with that.

Now let’s look at the Thule Air Base article. ChatGPT chose this as the main category:

Earth, Miscellaneous.

Wait, what? You are only supposed to pick one!

Or take this article about the Orion spaceship (which BTW seems to be an old article) https://www.spacedaily.com/reports/Orion_stretches_its_wings_ahead_of_first_crewed_Artemis_mission_999.html . What category does ChatGPT pick?

Kennedy Space Center and Orion spacecraft fall under "Miscellaneous".

Just answer the question please.

And sometimes ChatGPT gave me some extensive feedback. Take this article: https://www.spacedaily.com/reports/The_worlds_first_3D_printed_closed_afterburning_cycle_liquid_rocket_engine_successfully_flew_999.html. ChatGPT came up with these tags:

space technology, rocketry, China, Tianbing Technology, TH-11V engine.

But the response on picking the main category was:

The text should be categorized under "space technology" and "rocketry" as these are the most relevant categories. It doesn't fit neatly into any of the specific celestial bodies or topics listed in the second prompt, nor does it relate to artificial intelligence or meteorites.

Sounds reasonable. But you definitely need to be aware of such possible responses before you start relying on ChatGPT’s results.

So then I told ChatGPT to only pick one category. And if you don’t know what to do, pick Miscellaneous.

user_followup = f"Also categorize this text in only one of the following categories:\n\n {astro_categories}." \
               f"If it doesn't fit in any of these categories, categorize it under Miscellaneous."

Well, results were still mixed. Take for example this article from Space.com: https://www.space.com/best-free-star-trek-tng-and-picard-3d-prints . The tags ChatGPT chose are quite correct:

pop culture, television, 3D printing, Star Trek, Picard

What main category does ChatGPT pick now?


Telescopes? Really? The word “telescope” does not appear once in the article. (Also, this time no period at the end!)


First of all: I’m really happy that ChatGPT can come up with relevant tags. That works quite well and I intend to use it.

Second: if you want to use ChatGPT (version 3.5) in your data pipelines, you better be prepared for some very rigorous testing. Because it can sometimes throw some weird curveballs that can mess up the data quality equally well as humans can.

Posted in Python | Tagged , , , | Leave a comment

A Strava dashboard on a Raspberry Pi (Part 3): The Strava API

This is part 3 of a series of blogposts on how I created a Strava dashboard on a Inky Impression e-ink display with a Raspberry Pi.


This was the part that I expected to be the hard part: getting my data from Strava. Or, to be more precise: getting the connection right so the Strava API would allow me to get that data. Because it requires authentication via the OAuth2 protocol and I’ve tried a similar thing a few years back with a Google API and I just didn’t get it. But now I do.

Strava API documentation

It requires a whole “dance” between your computer code and the Strava API where you exchange all kinds of tokens back and forth. Strava’s Getting Started with the Strava API document explains it quite well. And this blogpost by Graziano Fuccio helped me a lot with the Python code: http://www.grace-dev.com/python-apis/strava-api/.

Frustratingly I still didn’t get it to work though. The reason I found out, is because the URL of the authentication has changed. From https://www.strava.com/oauth/token it became  https://www.strava.com/api/v3/oauth/token. I found this elsewhere in the Stava API documentation, where the correct URL was shown. I’ve told Strava that their Getting Started documentation is outdated. They asked me to create a ticket and I’ve done so, but I don’t think they changed their document yet. But Graziano Fuccio did though.

Continue reading
Posted in Howto, Python | Tagged , , , , , | 7 Comments

A Strava dashboard on a Raspberry Pi (Part 2): Installing software

In last blogpost we set up the Raspberry Pi, attached the Inky Impression display and got the Raspberry Pi ready for remote access.

Time to get the Inky Impression software installed and make the Inky Impression screen display something.

Your SSH connection of choice

For this we’re going to have to run some commands via remote SSH. There are multiple ways to log in remotely. You can use a tool like Putty or the terminal on MacOS (I like iTerm2). That’s actually simpler.

But I chose to use Visual Studio Code because you can edit Python code remotely via SSH straight on the Raspberry Pi.

To do this you must install Visual Studio Code. Visual Studio Code has all kinds of extensions. Here we will install the Remote – SSH extension. And while you’re at it, maybe install the Python extension as well, because we will be writing some Python later.

Installing the Remote-SSH extension in Visual Studio Code
Continue reading
Posted in Howto, Python | Tagged , , , , | 2 Comments

A Strava dashboard on a Raspberry Pi (Part 1): Setting up the Raspberry Pi

This is the list of hardware I’ve used:

  • An Inky Impression 5.7 inch e-ink display.
  • (The Inky Impression comes with a 40-pin female header included to boost height for full-size Pis and standoffs included to securely attach to your Pi)
  • A Raspberry Pi 3 model B+ (I had lying around) + power supply
  • A micro SD card with 8 GB storage or more.
  • Initially: keyboard, mouse and monitor (but if you configure the WiFi on the Raspberry Pi and configure it to allow remote SSH, you can connect to it via WiFi from the convenience of your regular computer)

For those who don’t know a Raspberry Pi: this is a very small and quite cheap computer. The Raspberry Pi 3B+ I’ve used for example is about 40 euros. But you can spend even less, because my Strava dashboard doesn’t exactly require a lot of computing power.

So you could instead use a Raspberry Pi Zero 2 W (15-25 euros), which takes up less space also. But I believe this will require soldering to attach the GPIO. And it seems to be out of stock on a lot of sites.

Continue reading
Posted in Howto, Python | Tagged , , , | 2 Comments

Building a Strava dashboard on a Raspberry Pi with an e-ink display

Let’s face it: my purchase of the Pimoroni Inky Impression 5.7 inch display was a solution looking for a problem. I saw a video about it and I was sold on the idea of having an e-ink display on one of my Raspberry Pi’s.

The Pimoroni Inky Impression on a Raspberry Pi 3B

While having a 7-colour e-ink display is cool and all, I had to come up with a good plan to utilize one. So it wouldn’t end up in a drawer after a short experiment.

You can use it to display images, but it is 7-colour. So you have to “dither” full colour images to have it display well. Actually comic book style images are displayed much better than the average dithered photo. The resolution is quite low (600×448) and the refresh rate is quite slow (10-20 seconds). But for some applications this is just fine.

Continue reading
Posted in Howto, Python | Tagged , , , | 1 Comment

Using Stable Diffusion to create images for a presentation

Have you heard about text-to-image models like DALL-E 2, Stable Diffusion and MidJourney? These are AI algorithms that take in text (the “prompt”) that describes what kind of picture you want as input and as output the algorithm creates that picture, based on billions of images.

An example could be “an astronaut on a bicycle on the moon by Van Gogh”. And this would be one of the results:

{“prompt”: {“software”: “imaginairy”, “prompts”: [[1, “an astronaut on a bicycle on the moon in the style of Van Gogh”]], “prompt_strength”: 7.5, “init_image”: “None”, “init_image_strength”: 0.6, “seed”: 938321671, “steps”: 40, “height”: 512, “width”: 512, “upscale”: false, “fix_faces”: false, “sampler_type”: “plms”}}

I got access to DALL-E 2 in July this year. DALL-E 2 is a closed source algorithm made by OpenAI. You can sign up to request access to DALL-E 2. Once you get access you can use it for free for a limited of runs. After that you have to pay to use it more.

Continue reading
Posted in Weird experiments | Tagged , , , , | Leave a comment