My first experiences with Apache NiFi

There are a lot of data-related Apache products out there and it’s hard to keep up with all of them. There are several products to stream or flow data (what’s the difference?). Like Kafka, Storm, Flink and NiFi. Yes, all products have documentation, but for an outsider their description sounds like “enterprise scalable streaming solutions”. What does that tell you?

I followed a Crash Course on Apache Nifi at the DataWorks Summit in München last month and was quite impressed. At heart I’m a command line kind of guy, but this graphical interface is really slick and it’s amazing what you can do to find out where your data goes to with NiFi. I decided to organize a workshop for my colleagues at Open Circle Solutions.

With NiFi you can program where your data comes from, what to do with it and where to send it. Let’s say for example you have some data that comes in JSON format from IoT devices, mobile apps send XML to you, you have server logs and for some reason you also import Twitter data. These are your “data producers” and you want that data to flow to different “data consumers”, like databases, Hadoop clusters or applications. With NiFi you can describe where data flows from the producers, convert data, split data and send this on the way to the consumers. Your data routing, as they call it.

And here is what that looks like.The way this works, is you drag and drop “processors” to get, convert or send data.

The NiFi canvas

There are many types of processors. You can get files, read from Twitter or Kafka, run SQL on a database, compress or uncompress files, send data via SFTP or delete a file in Hadoop. There are so many processors even, they have a search bar to quickly find them. Here I searched for “DB”.

The Add Processor window

Here is a finished result of a demo I did today. All the larger white rectangles in the image below are processors. The smaller rectangles connected with lines in between the processors describe the connection. Upper-left my flow starts with a processor type called “GetFile”, which I use to read a zip file. This makes a connection to an UnpackContent processor to get the XML files from these zip files.

NiFi’s interface

To make sure I don’t overload the rest of my system, I created a ControlRate processor which limits the amount files or bytes that are send through. From there the blue parts read the XML, get the attributes I want and send this further for other processors to use.

One of NiFi’s greatest strengths is the debug information. Suppose certain data is coming through, but it’s not correctly handled. You right click on a processor you want to investigate, and choose “Data provenance”.

Let’s all ignore the red squares in the processors that tell that they are stopped, okay?

Now you see all the recent “flow files” that have come through. You have a search option to look for certain data.

Data Provenance

Now let’s say I want know how that data came through. See that signs? When you click on it you get to see where your data went. It’s called the “data lineage”.

The data lineage

And here I can right click and get even more details on how much time was spend. Or.. what attributes the data had.

Here I saw the URL an earlier processor formed, so I could try if there was anything wrong with it. Turned out, Google told me I had exceeded my daily request quota on Google’s Places API. That’s why the rest of my flow didn’t work correctly.

So much debug data…

For good security the most important thing is to know where your data is. I think with NiFi you can have that covered pretty well. Hey, it’s brought to you by the NSA! I do not know yet about NiFi’s performance or separation of duties. But that’s for another blogpost.

You can try NiFi for yourself with a Hortonworks Data Platform sandbox (available as VirtualBox, VMWare or Docker image) and the Apache NiFi crash course. In my next blogpost I will to show you how you can quickly start the crash course without falling in the same pitfalls as I did.

About Marcel-Jan Krijgsman

In 2017 I made the leap to Big Data after 20 years of experience with Oracle databases. I followed courses on Hadoop, Big Data Analytics, Machine Learning and Python, MongoDB and Elasticsearch.
This entry was posted in Apache Products for Outsiders and tagged , , , , , , , , . Bookmark the permalink.

2 Responses to My first experiences with Apache NiFi

  1. Pingback: Last week in Stream Processing & Analytics – 10.5.2017 | Enjoy IT - SOA, Java, Event-Driven Computing and Integration

  2. Pingback: Quickly start of the Nifi crash course | Expedition Data

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.