Data engineering – Expedition Data

Data engineering in the European cloud – Part 2: Scaleway

This is Part 2 in a series where I try to create a data engineering environment in the European cloud. In Part 1 I described my plan for creating a data lakehouse in the European cloud. Now it’s time to get our hands dirty. We’re going to do this in the Scaleway cloud.

The architecture

To get this data lakehouse running we will create a Kubernetes cluster and object storage for our data storage. In Kubernetes we can run containerised applications that will run our data lakehouse. I’ve consulted ChatGPT for this architecture. It had a better and more modern solution than I originally had in mind.

We’re going to use the Apache Iceberg open table format. This will allow us to create database like tables based on Parquet formatted files. Nessie will be the Iceberg data catalog (Hive Metastore was another option). It allows our data solutions to find the Iceberg tables and underlying Parquet files.

Trino will be the query engine. That will be the fastest way to get our first queries going.

(more…)

By Marcel-Jan Krijgsman, 2 weeksJanuary 7, 2026 ago

Cloud

Data engineering in the European cloud – Part 1: the plan

We all know how dependent we have become in Europe on US cloud providers. We know about the risks of this in the current political climate. And yet we keep using more and more US cloud services. Read Bert Hubert’s writings about the European cloud situation.

And to be honest, when customers ask for advice on starting a new data engineering ecosystem, Azure Fabric and Databricks are on the top of my list.

But while it might be hard to switch from Office 365 to open source solutions (especially moving all your users to these unknown platforms), in the data engineering landscape there are so many widely adopted open source solutions. Solutions that end users rarely need to deal with directly. Couldn’t we run these products somewhere else? So I went on an investigation.

(more…)

By Marcel-Jan Krijgsman, 2 weeksJanuary 5, 2026 ago

Data engineering

How to use data to find the best spot for a sponsor event

As you might know I’m currently doing sponsor events for Tour for Life, to collect funds for the Daniel den Hoed Foundation, for cancer research.

Aniel, me and Transfer Solutions CTO Albert Leenders at a sponsor event last Saturday in Ede.

Aniel and me have been doing this for the 3rd year now. And we noticed quite big differences in proceeds per location. You’d think large crowds (like on Dam Square in Amsterdam) would guarantee large amounts of donations. Not so. A more humble place like my home town Gouda outdid them by a factor of 9 in the same year!

(more…)

By Marcel-Jan Krijgsman, 7 monthsJune 24, 2025 ago

A female data engineer frowning when looking at passed DQ checks

Azure

My experiences with Azure Purview

At my last customer I have extensively worked with Ataccama, a data management product. It has a data catalog to store metadata on datasets, and it can do data quality checks. In Azure Microsoft has a data management product too. It’s called Purview and I’ve used it in a PoC Read more

By Marcel-Jan Krijgsman, 9 monthsApril 10, 2025 ago

Active Learning

I started vlogging about data mesh (and other things)

Last June I made a short video while walking in the park next to the DIKW Intelligence office. And I posted it on LinkedIn. To my surprise it did very well. So I thought: why not make more of these short videos on data topics? And why not make them in somewhere in nature?

I’m on my bike almost every day this time of year. Surely I could make a short stop and do a little talk? I started to make them in Dutch and then also in English. (more…)

By Marcel-Jan Krijgsman, 3 years ago

Data engineering

Digging into video files for geolocations

So far I’ve found geolocations in XML metadata that my actioncam stores on disk as seperate .XML files and I’ve found them in JPG files. When I showed the cool maps I made to my father, he asked if I could create maps from his holiday videos. So that he can show cool maps in his video compilations.

Where do locations get stored in video files?

My father has a Sony PJ650VE video camera that makes videos in AVCHD format. Even the camera itself can show you a map of a video location. So I knew it should store geolocations somewhere. But looking on disk I saw no handy metadata files for me to read. So where did the locations go?

I learned that video formats like MP4, Quicktime (.mov) and AVCHD have EXIF metadata stored in them, just like JPG files. Luckily I had all the videos my father had made of our trip to the east coast of the USA in 2013. So I had lots of examples of AVCHD files to work with.

(more…)

By Marcel-Jan Krijgsman, 4 yearsMay 21, 2022 ago

Data engineering

Photo locations, marker icons and displaying photos on my map

When I was finished last week with creating my video location map in Python, I thought “shame I can’t plot photo locations”. That’s because my Fuji X-T30 camera doesn’t store GPS info. When I bought the camera I assumed every modern camera had GPS tagging, so I didn’t even checked that feature. Too bad. But I also made some photo’s during my vacations with my humble iPhone 8. And it does have GPS tags. So let’s plot some photo locations.

(more…)

By Marcel-Jan Krijgsman, 4 yearsMay 11, 2022 ago

Data engineering

Making my video location map even better with Folium

Yesterday I shared how I plotted locations of videos shot with my Sony FDR-X3000 camera on a map. I was already pretty happy. Then I got a tip from Twitter user Bob Haffner (@bobhaffner): why not use Folium to create my maps? Huh? I already got a working map now, Read more

By Marcel-Jan Krijgsman, 4 yearsMay 4, 2022 ago

Data engineering

Plotting video locations from my Sony camera in Python

Two years ago I bought a Sony FDR-X3000 actioncam to record video on my bike rides. And I’m really happy about it. It’s just great reliving my rides in 4K, going downhill for kilometers from some col I climbed. I also make compilation videos for fellow cyclists. Like these:

(more…)

By Marcel-Jan Krijgsman, 4 yearsMay 3, 2022 ago

Apache Products for Outsiders

My Github repo got 50 stars

I never imagined myself as a maintainer of a data engineering related open source thing. Yet. But when I was working on our data engineering course, I needed some kind of data lake software. At first I used the Cloudera sandbox, but some of my colleagues tried it and they complained it took way too much time to start and way to many resources of their laptop. It would be a good bet that our students would get that problem too.

Long story short: I found that Big Data Europe already had a simple Dockerized Hadoop. They actually did all the hard work. But I wanted it to have Hive and Spark too. I went playing with docker-compose yml files and learned a lot from that BTW. And after some initial frustrations it finally worked. (more…)

By Marcel-Jan Krijgsman, 4 years ago