There are a lot of data-related Apache products out there and it’s hard to keep up with all of them. There are several products to stream or flow data (what’s the difference?). Like Kafka, Storm, Flink and NiFi. Yes, all products have documentation, but for an outsider their description sounds like “enterprise scalable streaming solutions”. What does that tell you?
I followed a Crash Course on Apache Nifi at the DataWorks Summit in München last month and was quite impressed. At heart I’m a command line kind of guy, but this graphical interface is really slick and it’s amazing what you can do to find out where your data goes to with NiFi. I decided to organize a workshop for my colleagues at Open Circle Solutions.
With NiFi you can program where your data comes from, what to do with it and where to send it. Let’s say for example you have some data that comes in JSON format from IoT devices, mobile apps send XML to you, you have server logs and for some reason you also import Twitter data. These are your “data producers” and you want that data to flow to different “data consumers”, like databases, Hadoop clusters or applications. With NiFi you can describe where data flows from the producers, convert data, split data and send this on the way to the consumers. Your data routing, as they call it.
And here is what that looks like.The way this works, is you drag and drop “processors” to get, convert or send data.
There are many types of processors. You can get files, read from Twitter or Kafka, run SQL on a database, compress or uncompress files, send data via SFTP or delete a file in Hadoop. There are so many processors even, they have a search bar to quickly find them. Here I searched for “DB”.
Here is a finished result of a demo I did today. All the larger white rectangles in the image below are processors. The smaller rectangles connected with lines in between the processors describe the connection. Upper-left my flow starts with a processor type called “GetFile”, which I use to read a zip file. This makes a connection to an UnpackContent processor to get the XML files from these zip files.
To make sure I don’t overload the rest of my system, I created a ControlRate processor which limits the amount files or bytes that are send through. From there the blue parts read the XML, get the attributes I want and send this further for other processors to use.
One of NiFi’s greatest strengths is the debug information. Suppose certain data is coming through, but it’s not correctly handled. You right click on a processor you want to investigate, and choose “Data provenance”.
Now you see all the recent “flow files” that have come through. You have a search option to look for certain data.
Now let’s say I want know how that data came through. See that signs? When you click on it you get to see where your data went. It’s called the “data lineage”.
And here I can right click and get even more details on how much time was spend. Or.. what attributes the data had.
Here I saw the URL an earlier processor formed, so I could try if there was anything wrong with it. Turned out, Google told me I had exceeded my daily request quota on Google’s Places API. That’s why the rest of my flow didn’t work correctly.
For good security the most important thing is to know where your data is. I think with NiFi you can have that covered pretty well. Hey, it’s brought to you by the NSA! I do not know yet about NiFi’s performance or separation of duties. But that’s for another blogpost.
You can try NiFi for yourself with a Hortonworks Data Platform sandbox (available as VirtualBox, VMWare or Docker image) and the Apache NiFi crash course. In my next blogpost I will to show you how you can quickly start the crash course without falling in the same pitfalls as I did.