It is almost 9 years ago that I became a data engineer and I can honestly say my job is vastly different than what I thought it would be at the start. I still love what I do, but the reality is that data-product building is far more than just handling big data systems. Nine years ago though, I thought I would be kind of a Big Data DBA. I hadn’t given a lot of thought about the data modelling involved.
Gradually I learned that building data products takes quite a bit of forethought. And when you build data products you need to know your source data quite well.
Assume stuff and break things
When I was quite green in the field of data engineering I was asked to build a report on loan collaterals. I dug in the data warehouse. And there was a table called ISACBA.COLLATERAL. Great! My quest of data analysis is at an end. Let’s build the report.
When I showed the report to the end user, he told me the numbers were all wrong. Luckily he was understanding that I was new at this site. And I got a second chance. One of my team members told me what went wrong. There were three tables with COLLATERAL in the name. The one I picked had mortgages beside collateral for lease contracts. The company had actually stopped doing mortgages about 10 years ago (but some were still active contracts).
My mistake was to pick a table with “collateral” in the name and assume that this would provide my data needs. By the way: if you ask AI / a Large Language Model to do this data analysis and you don’t provide it with a data catalog or other metadata, expect similar results. Here the risk that AI is driven to provide an answer, any answer, is substantial.
More than meets the eye
But even if you got some information of what data can be found in a table (maybe from a data catalog), your analysis often isn’t finished. Take for example the work I did for a customer who asked our team to recreate an annual account from the source data. One part of the source data we needed were transactions from clients.
I managed to find the table that contained their transactions. After getting a grasp what this data covered, I managed to model this data for the gold layer in our data warehouse in such a way that it could be linked to invoices and that the numbers added were correct.
But there was more: there were tables called rejected_transactions and marked_transactions. And I needed to process these as well. First question: what are rejected and marked transactions? There was something that you could call – with some imagination – a data catalog. And the description there didn’t help, because it said that the rejected_transactions table contained transactions that were rejected. And the description of the marked_transactions table was much the same. The business glossary was blank on these terms. This was a good moment to ask their maintainers to add them.
I got on a call with one of the developers of the application that produced the source data. He told us that rejected transactions were on account of the company and the marked transactions were on account of the supplier. A later conversation with a business subject matter expert corrected that rejected transactions were on account of the supplier and marked transactions were manually processed specific issues that might be on account of the company, depending on the case.
I tried to explain this to my team members. We all got confused. A couple of times. To get everyone’s facts straight, we wrote a document how we think this was supposed to work and asked the business subject matter expert to factcheck this. The writing process helps you to clarify your thoughts on the matter and the factcheck helps you to verify you’re on the right track.
Digging deeper
Even when I started building the data product, this turned out not to be the complete story. I found that I got about 50,000 rows for marked transactions from the last 3 years. But by far the most were for something described as Bonus ABC. When I asked about that, I learned that these actually weren’t marked transactions. They were transactions that didn’t fit in the regular transaction table because it couldn’t be linked to an invoice. And a quick solution was to store them in the marked transaction table. And it became the de facto solution ever since.
And even that was not the complete story. Because when I looked a bit closer, I saw that for every marked and rejected transaction there were 2 rows. It turned out that the first row was the initial registration and the second row was for the moment the transaction was credited.
Only then I was able to model the rejected and marked transactions. By now we have incorporated them into the annual account data and happily all the numbers add up. Which is not a small feat of our team. A lot more data was involved.
My method for understanding source data
During this project I have honed my method for understanding the source data. I think you can summarize my method as follows (not always in exactly the same order):
- Discuss the request with the end user. What actions will the data (product) allow them to take? If separate, talk with the business subject matter expert to understand the matter more deeply.
- Write down how you think this matter works. Have it checked by the subject matter expert.
- Talk to the developers/maintainers of the source data. Where can this data be found?
- Explore the source data for this particular case. Use data profiling to get more insight in the data.
- When you understand the data more, try to follow specific scenarios in the data. Follow a rejected transaction. Does this sale end up in regular transactions table as well as rejected transactions table? What about a specific marked transaction? Try to follow the whole process through the data.
- When you’re confident enough you understand the data, build the data product.
- Have a team member check your work.
- Validate the results with a subject matter expert.
Humans only?
And I’m not saying this to assure I can never be replaced by AI. Though a lot of the work of a data engineer is talking to people, understanding their process and dive into the data. And even there you can sometimes find the data follows unwritten rules. I can see AI helping more with exploration in the near future, but it will be really hard for it to beat humans on the thorough understanding of the source data.
This blogpost was written by me, not AI. But I did ask a local gpt-oss-20b model to proofread this text for me and I have adjusted some things accordingly. The image was generated by ChatGPT.
0 Comments