Four years now I’ve been working as a data engineer. And when I started writing about how to enter this field (because people sometimes ask me), I found out it’s beter to start writing about what data engineering actually is. Because my view on that has changed. And actually, data engineering changed as well.
Back in 2017, when I made the jump from Oracle database administration, I thought, or was hoping, that a data engineer more or less was a data administrator in Big Data. Sure, it took a bit more programming skills and DevOps and all that, but I thought my experience in operations would largely pay off.
On the other hand, weren’t data engineers supposed to support data scientists, so the data would be prepped for them and they could iterate over this data faster? I found out data engineers exist without data scientists just as well. They provide data to the whole organization, so it can be data driven. Or management at least hopes it will be.
Over time I read and heard a lot of opinions of what a data engineer should be. It goes anything from the data store operations specialist to an almost purely Java programmer and anything in between. I’ve read books and blogs that swear you need to know Java.. or maybe Python is just fine.. or only SQL.
Anyway data engineering turned out a much wider field than I had expected. It really can be totally different things for different organizations and different people.
For one thing, I did not realize that people working with good old data warehouses and business intelligence were calling themselves data engineers as well. When I started working at my current employer DIKW 2 years ago, I learned how important data modelling is in the field of data engineering. Because, sure, you can throw a bunch of data together, but if modelled right it can be used much more efficiently by the data analysts and data scientists, or in any data driven organization as a whole.
But when I met a former colleague in the train to work once and told I was on a BI team at my client, she said “BI? Does that still exist? We’ve done away with BI in our entire organization”. I didn’t ask what it replaced it.
Another thing I read a lot, is that data engineering is all about the pipelines. It was initially not clear to me what pipelines were supposed to be. Were they specific ways of working, like containerization? Was it something where you needed to create REST API endpoints or something? Or was this simply getting data from A to B (possibly after transformation) with whatever technology you liked to use? My take on them it’s either the last one, or it refers to pipelines in CI/CD. In the case of CI/CD it’s actually your code that goes in a pipeline than your data.
Too large a field
By now I’m convinced that data engineering as a field is so wide, so vast, it’s simply impossible to learn it all. More so, if you try to learn it on your own, with only MOOCs and books. You’ll find this especially hard if you have to do this entirely outside your daily job.
It takes many years to become the Java/Python programmer who is comfortable with four types of databases/data stores and can create a REST API at will, preferredly in Azure/AWS/GCP, containerized on Kubernetes, while keeping the code version controlled in a CI/CD pipeline, and also making this data available, pleasantly modelled , the way mamma used to do. No wonder organizations have a hard time to find them.
So relax: every data engineer has his or her strengths and weaknesses.
Though I did not start writing this blogpost to advertise our Certified Data Engineering Professional course, we certainly did try to give you a big push in the data engineering field this way. But to prove a point: we created this course with four teachers, each with their specific expertise in one part of data engineering and four other colleagues helped with additional modules. Because even we didn’t have all this expertise combined in one teacher. And this resulted in a course of 12 days (one course day per week).
So what is data engineering?
I think you’ll find it’s a combination of multiple expertises. Not all expertises will be 100% required in all organizations. Some expertises will not even exist everywhere. And data engineers will not have all these expertises either.
I’d say these expertises (roughly) are:
- Databases and data stores. Think RDBMS-es, data lakes, document stores, graph databases, etc.
- Data modelling. Think Data Vault, etc..
- Development. Choose your programming language, being able to transform data, build REST API’s and I chose to place CI/CD here as well.
- Containerization. Think Docker and Kubernetes.
- Machine Learning support. Bringing ML in production and maintaining it.
- And of course the cloud.
Besides that there are other disciplines you need to know about, like data quality, data governance, security, privacy and ethics.
It’s fun though
Whatever variation of data engineering you end up in, I can say I’ve had a lot of fun learning all about Hadoop, MongoDB, Elasticsearch, graph databases, Docker, Kubernetes, NiFi, Python and pandas, Spark, etc., etc.. I mean: it’s a lot to take in. A lot to learn. But I must say that every time I learn a new type of software or concept, I encounter brilliant new ideas that produced them. And that often is just energizing. This whole four years I’ve been a data engineer never have been dull from that part.
How do you become a data engineer?
Originally it was this question I set out to answer in this blogpost. This is an interesting question. I regularly get asked about this by former colleagues and I’m happy to chat about it. I will come back to this question in a later post. I promise the answer won’t just be “buy my course”.