My experiences with Azure Purview

At my last customer I have extensively worked with Ataccama, a data management product. It has a data catalog to store metadata on datasets, and it can do data quality checks. In Azure Microsoft has a data management product too. It’s called Purview and I’ve used it in a PoC project.

A very short intro into data management

There’s more to data management than data catalogs and data quality, but I don’t want to rewrite the DAMA-MBOK book. I’ll try to keep this intro pretty short.

Data catalogs have metadata on your data. In this you can find what a table is for, what data in a column means. So that you create data products on the right data with the right kind of outcomes. And also important, who you can contact for more information on that data.

With data quality checks you “profile” a table to see what the data is like: are there NULLs? How many distinct values?. You can then apply data quality rules to make sure that the data adheres to certain rules. Rules like “there should always be a value in this column”, “the data should always be one of a certain number of values”, “the mail column should always be formatted as x@y.z”. Some data management tools allow to create very complex rules.

Purview

Purview got a pretty extensive redesign recently. Last November it got a new portal and the licensing model also recently changed. Last year we paid about 450 euros per month, now it’s “pay as you go”. I think it’s cheaper for most customers who start with data management, but I don’t have direct access to our cloud bills.

To start using Purview in Azure, you need to create a Purview account. I’ve created one last year, but things have changed quite a bit since then. Better to follow Microsoft’s instructions. Important to know: the Purview account is also a managed identity. The managed identity has the same name as the Purview account.

A couple of things you need to know working with Purview. Say you want to use it as data catalog and for data quality on a simple SQL database. These are, in short, the steps you need to take to get all the way to data quality profiling:

Create a “collection” (under Data Map, Domains).
Add a data source. (Connect to your database).
Do a scan on the data source (to find tables and views).
Create a governance domain (under Unified Catalog).
Create a data product and add relevant tables and views to that.
Profile your data (data quality).

My PoC project

Purview has extensive capabilities, but let’s start simple. These are the things I want to achieve:

Allow Purview access to my Azure SQL database.
Scan for tables and views.
Add metadata on these tables and views in the data catalog.
Do data quality checks.

To test the data quality checks I have created a specific table with pretty bad data quality: all columns are strings. Date columns don’t adhere to one format. The status column should have three possible values (“pending”, “completed” or “failed”). And the amount column should of course have numeric values.

Database access

I would love to tell you that I have tested Purview on all Azure data storage solutions known to man, but in fact getting access to a simple SQL database was already a challenge. I’ve chosen to use a managed identity. There are two ways to do that:

Create your own managed identity. Allow Purview to use it and give it access to the SQL database.
Use the managed identity of your Purview account. Give it access to the SQL database.

Long story short: I had the most success with the managed identity of the Purview account. Using my own managed identity did work with a data source scan, but with data quality scans it would not appear as an option. I’ve asked questions about this at the MS community / Learn site and found out the solution myself.

To give Purview access, create a user based on a managed identity and give it access:

CREATE USER [mypurviewaccount] FROM EXTERNAL PROVIDER;
EXEC sp_addrolemember 'db_datareader', [mypurviewaccount]  
EXEC sp_addrolemember 'db_datawriter', [mypurviewaccount]

With this access you should be able to do both scans on the data source and data quality profiling.

Data catalog

Once you’ve created the collections and data source and done the data source scan, the data catalog works pretty straightforward. You can add descriptions to the table and columns.

And you can add data experts and owners. Which is useful if anyone has questions.

In the Schema you can add classifications. For example you can add classifications for Personal Identifiable Information (PII).

There also is functionality for lineage. But I didn’t get lineage discovery (preview feature) working with my managed identity.

Getting data quality profiling working

You’d think that with the data scan done, you’re a small step removed from doing a data quality profiling run. Not quite. For data quality profiling runs you need to have a data product. And to have a data product, you need a governance domain. You can do all that under the Unified Catalog. I’m not going into detail here how to do that. It’s not the hard part.

Unfortunately you need to define a new connection for data quality profiling (though it can reuse Data Source settings). And every time I haven’t used Purview for a while, I forget where in the Purview menus that is situated. So let me document that here. Under Unified Catalog, go to Health management > Data quality. Go to the governance domain, Manage and Connections:

Here I also used the Purview managed identity mentioned above. The other managed identity I created didn’t appear as an option.

Data quality: profiling

With all that settled we should be able to do a data quality profiling run. Remember that table with poor data quality? Let’s see what Purview makes of it. This is the profile page for the table after the profile data run:

Except for transaction_id all columns are nvarchar(255). But Purview detects that amount and customer_id are numbers and date is Date/Time. So far so good.

Lets look at the date column:

So what do we see here? Dates formatted as yyyy-mm-dd. No null values. 7940 rows (of 10,000 rows!). Now lets look at this table in DBeaver:

I see many, many different date formats. This was not worth mentioning?

Same with the amount column.

You’d think that the non numeric values would be worth a mention. A warning of some sort. But the only way I can deduce that there could be some kind of problem, is that the total number of rows in this column is not the total number of rows in the table.

And let’s look at the status column.

Here Purview has found empty values. You get a good view of the distribution of the 4 occurring values.

Data quality: rules

We need clearly need to set some rules here. Like:

The status column should not be NULL/empty.
The status column should only have values of this list: “completed”, “pending” and “failed”.
The date column should adhere to the yyyy-mm-dd format.
The amount column should be numeric.
The customer_id column should be an integer (ideally it would be a customer_id that exists in a customer table somewhere).

Let’s start with a rule that the status column should have no empty or blank values. For this there is an existing rule type.

I pick the status column, add a description of the rule and click create. Then I do a quality scan. Here is the result:

Wait! What!? A score of 100! This is my data:

There are definitely empty values. Does this rule only check for empty strings and not NULL values? If so, why isn’t there a NULL rule check I can choose from? It’s a problem many people have. (Later I’ve created a custom rule with the notNull expression. Again I got a score of 100. There’s definitely a bug in that expression.)

For the date column I went with a string format match, with a regular expression that checks for the yyyy-mm-dd format:

That is a good result. Not the data quality of this column, but that the bad quality was detected.

Let’s make a rule that the amount column is numeric. Data type match seems to be a logical choice.

But I can’t choose the desired data type.

And the result is a quality score of 100. What?!

It’s starting to look a lot that I will be choosing Custom rules most of the time:

There’s a lot to choose from here. But at least I can pick a very specific rule that does what I want:

I ran a new quality scan and the result is that it found 1004 rows failed. So the rule works.

In the end you will get a total data quality score for your table. Here are the results per rule:

Here you can see that the total quality is medium.

When I applied all rules I wanted (as described above) the total score was 81.8%, which was marked as quality score “High”.

Combined with the data quality scans for the whole governance domain HR the total score is 90.9 or “High”. I guess we’re done here.

Let’s be fair, when you average out data quality scores for all your tables, it’s likely you get high marks in any data management tool. It might fool your management, but not the end users.

Conclusion

Purview has enough to get started with data management on your Azure cloud. Connecting Purview to my SQL database was harder than I assumed it would be. And not everything is documented.

The data catalog is as you would expect. Other data management products can also detect phone numbers, mail addresses and other PII data. Purview is definitely not there yet.

The results of the data quality profiling are confusing. You’d expect all alarm bells would go off for a table with such poor quality. Just a data quality profiling run is not very useful. You’ll have to create rules. The result of the empty/blank fields rule was just wrong. And I can see myself creating custom rule after custom rule to get the real data quality checks that I need.

My experiences with Azure Purview

Published by Marcel-Jan Krijgsman on April 10, 2025April 10, 2025

A very short intro into data management

Purview

My PoC project

Database access

Data catalog

Getting data quality profiling working

Data quality: profiling

Data quality: rules

Conclusion

0 Comments

Leave a Reply Cancel reply

Data modelling – the detective work you need to do to understand your source data.

Data engineering in the European cloud – Part 2: Scaleway

Data engineering in the European cloud – Part 1: the plan

My experiences with Azure Purview

Published by Marcel-Jan Krijgsman on April 10, 2025April 10, 2025

A very short intro into data management

Purview

My PoC project

Database access

Data catalog

Getting data quality profiling working

Data quality: profiling

Data quality: rules

Conclusion

0 Comments

Leave a Reply Cancel reply

Related Posts

Data modelling – the detective work you need to do to understand your source data.

Data engineering in the European cloud – Part 2: Scaleway

Data engineering in the European cloud – Part 1: the plan