Data engineering in the European cloud – Part 2: Scaleway

This is Part 2 in a series where I try to create a data engineering environment in the European cloud. In Part 1 I described my plan for creating a data lakehouse in the European cloud. Now it’s time to get our hands dirty. We’re going to do this in the Scaleway cloud.

The architecture

To get this data lakehouse running we will create a Kubernetes cluster and object storage for our data storage. In Kubernetes we can run containerised applications that will run our data lakehouse. I’ve consulted ChatGPT for this architecture. It had a better and more modern solution than I originally had in mind.

We’re going to use the Apache Iceberg open table format. This will allow us to create database like tables based on Parquet formatted files. Nessie will be the Iceberg data catalog (Hive Metastore was another option). It allows our data solutions to find the Iceberg tables and underlying Parquet files.

Trino will be the query engine. That will be the fastest way to get our first queries going.

(more…)

Data engineering in the European cloud – Part 1: the plan

We all know how dependent we have become in Europe on US cloud providers. We know about the risks of this in the current political climate. And yet we keep using more and more US cloud services. Read Bert Hubert’s writings about the European cloud situation.

And to be honest, when customers ask for advice on starting a new data engineering ecosystem, Azure Fabric and Databricks are on the top of my list.

But while it might be hard to switch from Office 365 to open source solutions (especially moving all your users to these unknown platforms), in the data engineering landscape there are so many widely adopted open source solutions. Solutions that end users rarely need to deal with directly. Couldn’t we run these products somewhere else? So I went on an investigation.

(more…)

How to use data to find the best spot for a sponsor event

As you might know I’m currently doing sponsor events for Tour for Life, to collect funds for the Daniel den Hoed Foundation, for cancer research.

Aniel, me and Transfer Solutions CTO Albert Leenders at a sponsor event last Saturday in Ede.

Aniel and me have been doing this for the 3rd year now. And we noticed quite big differences in proceeds per location. You’d think large crowds (like on Dam Square in Amsterdam) would guarantee large amounts of donations. Not so. A more humble place like my home town Gouda outdid them by a factor of 9 in the same year!

(more…)

I started vlogging about data mesh (and other things)

Last June I made a short video while walking in the park next to the DIKW Intelligence office. And I posted it on LinkedIn. To my surprise it did very well. So I thought: why not make more of these short videos on data topics? And why not make them in somewhere in nature?

I’m on my bike almost every day this time of year. Surely I could make a short stop and do a little talk? I started to make them in Dutch and then also in English. (more…)

Photo locations, marker icons and displaying photos on my map

When I was finished last week with creating my video location map in Python, I thought “shame I can’t plot photo locations”. That’s because my Fuji X-T30 camera doesn’t store GPS info. When I bought the camera I assumed every modern camera had GPS tagging, so I didn’t even checked that feature. Too bad. But I also made some photo’s during my vacations with my humble iPhone 8. And it does have GPS tags. So let’s plot some photo locations.

(more…)

My Github repo got 50 stars

I never imagined myself as a maintainer of a data engineering related open source thing. Yet. But when I was working on our data engineering course, I needed some kind of data lake software. At first I used the Cloudera sandbox, but some of my colleagues tried it and they complained it took way too much time to start and way to many resources of their laptop. It would be a good bet that our students would get that problem too.

Long story short: I found that Big Data Europe already had a simple Dockerized Hadoop. They actually did all the hard work. But I wanted it to have Hive and Spark too. I went playing with docker-compose yml files and learned a lot from that BTW. And after some initial frustrations it finally worked. (more…)