Things I learned about Azure Data Fabric

Currently I’m helping colleagues to read open data in Azure Data Fabric. Here are some of my experiences with it.

I don’t want to do an extensive description of what Data Fabric is. In short, if you have an organisational Azure account, you can enable Data Fabric. You can then create Fabric workspaces and within workspaces you can create lakehouses for storage, pipelines and notebooks for automation. Lakehouses are like data lakes that act a bit like data warehouses with schemas, tables etc.. So you can store unstructured data in them as files, but also have tables you can read with SQL.

Notebooks are quite nice

You can automate stuff by calling notebooks with Python. And the nice thing about that, is that you can add text formatted with Markdown. So you can add nicely formatted documentation about what steps in your notebook are doing.

So it’s code, but it’s also documentation.

Lakehouses need time to start

One of the features data lakehouses bring, is the separation of compute and storage. Storage is cheap and it can stay online. But compute (CPU, GPU) is expensive and you want to not be billed for those longer than necessary. So compute clusters are automatically deactivated after a while.

That’s great, but when you connect again, they take time to start up. If you create many different lakehouses for different purposes in your Fabric workspace, and if you have pipelines that access many of them, each one of them needs to startup. And this can take minutes.

So better go with one big lakehouse with many schemas for different purposes, I think. Though I’ve also been working with several notebooks on the same lakehouse and then I got a message I could only connect with one notebook to a lakehouse at the time.

Where are my files in a lakehouse?

When you’re not familiar with lakehouses yet, and you created files to be stored in the lakehouse, you sometimes wonder: where are they?

Know that the lakehouse explorer has two modes, and the default one (I think) is SQL analytics endpoint. There you won’t see any files. Switch to Lakehouse view and you’ll see the file side of things.

Fabric has Data Factory, but with less debug

I’m not exactly a fan of low-code solutions. I’ve been working with some. Debug seems always to be lacking for one thing. In Azure Data Factory does have debug features actually. But the variant of Data Factory in Data Fabric doesn’t have those debug features. So when you don’t understand what data is going through the pipeline, your only hope is writing the in between data in files or tables.

For example, I have this Lookup action that reads a table with dataset names. And in the Copy data1_copy1 action it is supposed to do an Odata call and store the results that come back. The Lookup1 action does have a preview data option. But there’s no clear way what the pipeline is doing in subsequent actions, what data went through the pipeline. So I’m still quite in the dark why this pipeline is writing the results of each Odata call to all the lakehouse files.

You can also use a Get Metadata action in between. But chances are that if you wanted to use Get Metadata to investigate a failing action that comes after, Get Metadata is now the failing action.

Meanwhile, notebooks do have all the logging and debugging data you’d could want. So if it is up to me, I wouldn’t go with Data Factory pipelines.

Where are my connections?

In Data Factory I find pipelines within pipelines within pipelines already hard to maintain. But then there are connections. Once you have created a connection (to an Odata source or another workspace or pipeline) it is really difficult to find it back.

Someone asked questions about how to find a connection when you have a connection ID and Microsoft answer basically was “Oh, that’s a good idea. We should build an interface for that some day”.

This entry was posted in Azure, Cloud, Things I Learned and tagged , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *