What I think data engineering is (revisited)

Four years now I’ve been working as a data engineer. And when I started writing about how to enter this field (because people sometimes ask me), I found out it’s beter to start writing about what data engineering actually is. Because my view on that has changed. And actually, data engineering changed as well.

Back in 2017, when I made the jump from Oracle database administration, I thought, or was hoping, that a data engineer more or less was a data administrator in Big Data. Sure, it took a bit more programming skills and DevOps and all that, but I thought my experience in operations would largely pay off.

On the other hand, weren’t data engineers supposed to support data scientists, so the data would be prepped for them and they could iterate over this data faster? I found out data engineers exist without data scientists just as well. They provide data to the whole organization, so it can be data driven. Or management at least hopes it will be.

(more…)

Book review: Spark in Action, 2nd edition

There are lots of books on Spark, but not a lot that aimed at the data engineer. Data engineers use Spark to ingest and transform data, which is different from what data scientists use it for.

On the Roaring Elephant podcast I heard an interview with Jean-Georges Perrin, author of Spark in Action, 2nd Edition, and it was clear that this would be a very data engineering centered Spark book. So I decided to buy the ebook (also because, as a Patreon of the Roaring Elephant podcast, I have a discount key at Manning Publishing).

Spark in Action, 2nd Edition, is not yet finished. It’s a so called MEAP (Manning Early Access Program), which means the author is still writing parts. But he already wrote chapters 1 to 15 and many appendices, so he seems pretty far advanced. I’ve read all the regular chapters and I can honestly say that I did a little proofreading.

(more…)