I’ve been thinking of writing a blogpost about Apache Atlas. For one and a half years I’ve gained a unique experience with this product that I would like to share with the world.
But first we need to talk about metadata. That is one of the important uses of Apache Atlas. Meaningful metadata won’t get in there by accident. Maybe you are just starting your journey into metadata. I’m here to say that it’s going to take work. Not just by you, but everyone in your organization who has a stake in data. So in this blogpost I will be talking more about the organizational side of metadata and not so much on the technical side.
What do I mean by metadata?
Metadata can mean many things. Search it and you’ll find that there’s metadata used to “get to know you better” by companies, or in other words: for ad targeting. There also is metadata used by intelligence agencies to find out if you plan to do anything bad. But the metadata I’m talking about is the kind of information that you can use to find data in an organization.
Why do we need metadata?
Does any of these scenarios sound familiar to you?:
- Your data scientists and data analysts have a hard time to find the data they want. Sometimes they just give up. And then two weeks later a co-worker from a different departement says: “oh, but haven’t you checked system XYZ?” and they did not even know that system existed or how to get data from there.
- Your organization for some reason can’t keep up the pace with the competition in developing new applications and features. For some reason development always takes longer at your organization. Quite possibly because the data architecture has become very complex. None knows exactly what data is where.
- One of your colleagues, Carl, is the go-to-guy with the domain knowledge, combined with the knowledge about the data systems. But he’s constantly is in long project meetings. When he finally can work on his tasks, his co-workers need to constantly interrupt him to ask him questions about where they can find certain data and what the data means. Carl never gets his user stories done on time. They say he has time management issues. They say he should say “No” more often. And when he does, user stories of his colleagues come to a stand still. (Carl’s manager asks him to help them, but still thinks Carl has time management issues. He is sent on a training for that.)
If only there was some kind of system where people could look things up instead bothering Carl all the time. And this is where metadata could help. Especially if you have some kind of metadata search engine where data scientists, data analysts and Carl’s co-workers can look up information on data for themselves. Apache Atlas could be that search engine. But more on that in later blogposts.
I even think that metadata, done properly, can give organizations just that edge they need to develop faster. Because of data visibility data scientists, analysts and developers can grab the data they need right away and start making data products.
And there are more reasons for metadata. There is the European GDPR law that came in effect last year. This law says, amongst other things, that customers who’s records you keep in your databases, can request to see these records and also can require you to delete their records. And you have to comply to this within a month. The thing is: where are their records? Here metadata can help as a means of administration on your datasets, especially data lineage. Data lineage tells you how data “flows” through your data architecture. So when a customer asks you to remove their records, you know exactly where to delete that.
Metadata – what does need to be in it?
How are we going to find that dataset we’re looking for? Probably we want to look up certain keywords and then find the associated datasets. It’s hard to say what metadata we have to put in to allow users to get to find the data they want. We’re breaking new ground here.
Here’s are the metadata topics we came up with at a governmental organization. This was the metadata I added for datasets in the data lake:
Source is about where the data comes from.
- Did our organization produce it, or did we download or buy it from an external source?
- Who can we contact about this source? This might come in handy when an external party decides to change the format of the data, like field names or data types. We might need to change our software then.
- What do we have consider when working with the dataset (law, copyright)? Is there limitation on usage, like that you’re not allowed to distribute it to 3rd parties?
- Who is ultimately responsible for the data (in our organization) and who can we contact about it?
- A useful feature for users might be the information how they can get access to this dataset. In our case they needed to ask to be added to a specific Active Directory group.
This was metadata to help the data engineers.
- Description: I used a text field in Apache Atlas to add a short description of the dataset. Atlas had maximum 500 characters here, so there was a clear limit.
- Link to the data catalog: does your organization have a central data catalog (preferably with permanent URLs) with terms and definitions? That would be great to link to, especially with column data.
- Link to documents: if you have documentation about this dataset where users can find more (for example on a Wiki or Sharepoint), this is where you could put the link.
A special tag in Atlas we had to tell if this data was sensitive data.
How deep do we need to go?
Where do you need to put metadata on? Whole datasets or individual columns/fields/attributes? Let’s say we want metadata for datasets of a port authority (I’m completely making this up). What would our metadata look like?
Metadata on a whole dataset might be: “This dataset contains the all radar data in our port up to x kilometers out of sea for seaships and inland shipping”.
Metadata on a column might be like “the shipslength column contains the Length Over All of a ship, measured from bow to stern”. You can imagine how much technical detail you could store here. But where to get that?
Where to get the metadata?
Here we get to a crucial point. Metadata will not get here by magic. Software like Apache Atlas has interesting data lineage features, that can help you to tell where data comes from. But most other stuff you have to put in yourself. And dont forget: it needs to be kept up to date.
So don’t assign this to one employee to come up with all the metadata, especially if they are no subject matter expert. I’ve played that role and I can honestly say: it’s not the way to go. Instead make all teams with a stake in datasets responsible together to come up with metadata.
And for this you need a metadata system where teams can very easily add and search metadata. Is Atlas that metadata system? Let me show you how Atlas works in a future blogpost and you can decide for yourself.
Sometimes a table is sufficient
Let me share one more piece of experience with metadata. So I’ve put a lot of work in adding metadata to Apache Atlas on all the datasets. But after a while it was clear the solution wasn’t popular.
And then, a few days before I would leave this job, I was asked if I could come up a list of datasets. I thought Atlas was that list. All metadata on our datasets was in there. At long last I decided to create a simple table of datasets in Confluence. It had a list of dataset names, short descriptions, locations in our production environment and sizing. The feedback was very positive. Everyone said that that was what they were looking for all along.
You might want to check if a table of datasets is all your organization is looking for when they say they want metadata. Just to be sure.
If this blogpost has been worthwhile for you in some way, let me know in the comments. I would love to know.