Data engineering – Page 2 – Expedition Data

My Github repo got 50 stars

I never imagined myself as a maintainer of a data engineering related open source thing. Yet. But when I was working on our data engineering course, I needed some kind of data lake software. At first I used the Cloudera sandbox, but some of my colleagues tried it and they complained it took way too much time to start and way to many resources of their laptop. It would be a good bet that our students would get that problem too.

Long story short: I found that Big Data Europe already had a simple Dockerized Hadoop. They actually did all the hard work. But I wanted it to have Hive and Spark too. I went playing with docker-compose yml files and learned a lot from that BTW. And after some initial frustrations it finally worked. (more…)

By Marcel-Jan Krijgsman, 4 years ago

Active Learning

Five years of data engineering

Five years ago I made the switch from Oracle database administration to data engineering. It has been quite a ride. I made a video about this to celebrate.

By Marcel-Jan Krijgsman, 4 years ago

Active Learning

What a year 2021 has been

So at the end of 2021 I found myself in the waiting room of an emergency dentist. An infection above my front teeth became unbearable. Fortunately antibiotics makes my live much better now. Let that event not colour my view on 2021. For me 2021 was a great year, despite Read more

By Marcel-Jan Krijgsman, 4 yearsJanuary 2, 2022 ago

Data engineering

What I think data engineering is (revisited)

Four years now I’ve been working as a data engineer. And when I started writing about how to enter this field (because people sometimes ask me), I found out it’s beter to start writing about what data engineering actually is. Because my view on that has changed. And actually, data engineering changed as well.

Back in 2017, when I made the jump from Oracle database administration, I thought, or was hoping, that a data engineer more or less was a data administrator in Big Data. Sure, it took a bit more programming skills and DevOps and all that, but I thought my experience in operations would largely pay off.

On the other hand, weren’t data engineers supposed to support data scientists, so the data would be prepped for them and they could iterate over this data faster? I found out data engineers exist without data scientists just as well. They provide data to the whole organization, so it can be data driven. Or management at least hopes it will be.

(more…)

By Marcel-Jan Krijgsman, 5 yearsMay 19, 2021 ago

Data engineering

Don’t do data management just because you have to

Lately more and more organizations are doing data management. Suddenly there are data owners, data stewards and metadata repositories (in whatever form) everywhere. We all seem to do this mainly because we have to. Because of the GDPR or the California Consumer Privacy Act (CCPA). Or because other institutions demand we can explain where our data comes from.

But in my oppinion there is one important reason that mostly is overlooked. One that nevertheless has an important positive impact on business results, but also doesn’t seem to end up in the KPI’s. And that is how much time it takes to find the right data when building data products. (more…)

By Marcel-Jan Krijgsman, 5 yearsFebruary 3, 2021 ago

Data engineering

Tech dossier: pandas

I’m keeping tech dossiers in Evernote on open source products I want to keep track of. And I decided to put them on my blog. My previous ones were on Kubernetes and Elasticsearch. This one is on the Python data management library pandas. A short description – in English Read more

By Marcel-Jan Krijgsman, 7 yearsSeptember 26, 2019 ago

Data engineering

Things I’ve learned about metadata for a data lake

I’ve been thinking of writing a blogpost about Apache Atlas. For one and a half years I’ve gained a unique experience with this product that I would like to share with the world.

But first we need to talk about metadata. That is one of the important uses of Apache Atlas. Meaningful metadata won’t get in there by accident. Maybe you are just starting your journey into metadata. I’m here to say that it’s going to take work. Not just by you, but everyone in your organization who has a stake in data. So in this blogpost I will be talking more about the organizational side of metadata and not so much on the technical side.

What do I mean by metadata?

Metadata can mean many things. Search it and you’ll find that there’s metadata used to “get to know you better” by companies, or in other words: for ad targeting. There also is metadata used by intelligence agencies to find out if you plan to do anything bad. But the metadata I’m talking about is the kind of information that you can use to find data in an organization.

(more…)

By Marcel-Jan Krijgsman, 7 years ago

Data engineering

Book review: Spark in Action, 2nd edition

There are lots of books on Spark, but not a lot that aimed at the data engineer. Data engineers use Spark to ingest and transform data, which is different from what data scientists use it for.

On the Roaring Elephant podcast I heard an interview with Jean-Georges Perrin, author of Spark in Action, 2nd Edition, and it was clear that this would be a very data engineering centered Spark book. So I decided to buy the ebook (also because, as a Patreon of the Roaring Elephant podcast, I have a discount key at Manning Publishing).

Spark in Action, 2nd Edition, is not yet finished. It’s a so called MEAP (Manning Early Access Program), which means the author is still writing parts. But he already wrote chapters 1 to 15 and many appendices, so he seems pretty far advanced. I’ve read all the regular chapters and I can honestly say that I did a little proofreading.

(more…)

By Marcel-Jan Krijgsman, 7 yearsSeptember 1, 2019 ago