Lately more and more organizations are doing data management. Suddenly there are data owners, data stewards and metadata repositories (in whatever form) everywhere. We all seem to do this mainly because we have to. Because of the GDPR or the California Consumer Privacy Act (CCPA). Or because other institutions demand we can explain where our data comes from.
But in my oppinion there is one important reason that mostly is overlooked. One that nevertheless has an important positive impact on business results, but also doesn’t seem to end up in the KPI’s. And that is how much time it takes to find the right data when building data products.
Think here about the time it takes for a data engineer to build a pipeline, or for a data scientist to build a model. They need data for that. And looking for that data usually starts with asking around (if they are even aware of that data’s existence). Usually that data engineer or data scientist has to ask a colleague, keeping that colleague from her or his work. Sometimes only for 15 minutes, but sometimes even weeks to explain all the definitions and limitations of the necessary fields or attributes. Some colleagues rarely get to their own work, because there are so many data specialists who need to ask them questions.
If that data engineer or data scientist doesn’t find the right data, it could also be that they build on the wrong data. It might become clear only after many months that a data product was built on quicksand. “That table you have used, has been created specially for department X, but it won’t take in account Y or Z”, is what then can be heared. So start looking for data again.
Benefits under the radar
A lot of this lost time stays mainly under the radar. But not at every company. There are organizations where they have KPI’s to keep track of how quickly new employees (of data teams) are able to work at full productivity. If it takes, in general, too much time for new employees to be fully productive, they say, something is amiss.
Now some people might say “surely you want new employees to be productive as quickly as possible?” You’d might think so, but apparently not everyone thinks like that. I’ve worked at an organization where they thought it better if new employees have to keep looking for stuff themselves. The philosophy being that they would get to learn the organization bit by bit that way. But this searching (for data) isn’t done after a few months. It might be going on for years.
Metadata
So how do data driven organizations make sure that employees find data quickly and that new team members are productive early? For this you need easily available information on the data of the organization: metadata. In many cases this metadata is still kept in Excel sheets. But there is a new kind of product on the rise: data discovery tools.
I myself was able to experience one of those tools: Apache Atlas. In Atlas you can tag data with metadata. This is very useful. You can find who is the data owner of a dataset, what a field or column means, maybe even links to the data catalog. Atlas can even keep data lineage… as long as it happends in the Hadoop data lake. Which is an important downside of Atlas: it mainly keeps metadata only on the data lake. Almost no organization only stores their data in a data lake.
New data discovery tools
I remember looking for some kind of product that was able to more. But unfortunately I had to conclude that such a product simply was not available. This is now changing. Several new data discovery tools have recently been released. I was able to have a look at Amundsen from Lyft (a ride sharing company). Amundsen can index RDBMS-ses, NoSQL databases and several data lake components. On premise and in the cloud. When Amundsen looks for tables, files or indexes, it also makes a summary of number of rows, distinct values, NULL values and it shows a couple of rows of sample data (when you have access I presume).
And it isn’t only a technical thing. You can add data owners to files or tables, with their contact info. If something about the metadata isn’t clear or not correct, you can start a Jira ticket. (So initially that means work for that colleague we always keep asking questions about data at, but soon enough that should not be necessary that much.)
Amundsen isn’t the only data discovery tool. LinkedIn last year made DataHub open source. Another product I’ve found is OpenLineage, an open standard for metadata and data lineage. It all still feels a bit new, but if this is the way we are going, it looks great.
The future?
In a few years, it will become the normal thing for new employees to be given a laptop, an account and a link to the data discovery platform when they arrive. Which means new employees will quickly be up and running with their teams. The organizations they work in will have that important advantage over their competitors, because they can build data products much faster.
I can’t wait until I will encounter data discovery tools everywhere. In our Certified Data Engineering Professional course we discuss these in our course modules about data management and data lakes.