Five years of data engineering
Five years ago I made the switch from Oracle database administration to data engineering. It has been quite a ride. I made a video about this to celebrate.
Five years ago I made the switch from Oracle database administration to data engineering. It has been quite a ride. I made a video about this to celebrate.
So at the end of 2021 I found myself in the waiting room of an emergency dentist. An infection above my front teeth became unbearable. Fortunately antibiotics makes my live much better now. Let that event not colour my view on 2021. For me 2021 was a great year, despite Read more
Four years now I’ve been working as a data engineer. And when I started writing about how to enter this field (because people sometimes ask me), I found out it’s beter to start writing about what data engineering actually is. Because my view on that has changed. And actually, data engineering changed as well.
Back in 2017, when I made the jump from Oracle database administration, I thought, or was hoping, that a data engineer more or less was a data administrator in Big Data. Sure, it took a bit more programming skills and DevOps and all that, but I thought my experience in operations would largely pay off.
On the other hand, weren’t data engineers supposed to support data scientists, so the data would be prepped for them and they could iterate over this data faster? I found out data engineers exist without data scientists just as well. They provide data to the whole organization, so it can be data driven. Or management at least hopes it will be.
Lately more and more organizations are doing data management. Suddenly there are data owners, data stewards and metadata repositories (in whatever form) everywhere. We all seem to do this mainly because we have to. Because of the GDPR or the California Consumer Privacy Act (CCPA). Or because other institutions demand we can explain where our data comes from.
But in my oppinion there is one important reason that mostly is overlooked. One that nevertheless has an important positive impact on business results, but also doesn’t seem to end up in the KPI’s. And that is how much time it takes to find the right data when building data products. (more…)
I’m keeping tech dossiers in Evernote on open source products I want to keep track of. And I decided to put them on my blog. My previous ones were on Kubernetes and Elasticsearch. This one is on the Python data management library pandas. A short description – in English Read more
I’ve been thinking of writing a blogpost about Apache Atlas. For one and a half years I’ve gained a unique experience with this product that I would like to share with the world.
But first we need to talk about metadata. That is one of the important uses of Apache Atlas. Meaningful metadata won’t get in there by accident. Maybe you are just starting your journey into metadata. I’m here to say that it’s going to take work. Not just by you, but everyone in your organization who has a stake in data. So in this blogpost I will be talking more about the organizational side of metadata and not so much on the technical side.
Metadata can mean many things. Search it and you’ll find that there’s metadata used to “get to know you better” by companies, or in other words: for ad targeting. There also is metadata used by intelligence agencies to find out if you plan to do anything bad. But the metadata I’m talking about is the kind of information that you can use to find data in an organization.
There are lots of books on Spark, but not a lot that aimed at the data engineer. Data engineers use Spark to ingest and transform data, which is different from what data scientists use it for.
On the Roaring Elephant podcast I heard an interview with Jean-Georges Perrin, author of Spark in Action, 2nd Edition, and it was clear that this would be a very data engineering centered Spark book. So I decided to buy the ebook (also because, as a Patreon of the Roaring Elephant podcast, I have a discount key at Manning Publishing).
Spark in Action, 2nd Edition, is not yet finished. It’s a so called MEAP (Manning Early Access Program), which means the author is still writing parts. But he already wrote chapters 1 to 15 and many appendices, so he seems pretty far advanced. I’ve read all the regular chapters and I can honestly say that I did a little proofreading.