With the ever increasing buzz of Internet of Things (IoT) which now is slowly moving towards Internet of Any Thing (IoAT) we are finding the need of an ideal solution to move the data between various processing platform. The answer to this challenge is provided by Apache NiFi.
The bright and shiny new data orchestration tool provided by Hortonworks – Hortonworks DataFlow (HDF) – is powered by Apache NiFi.
What is Apache NiFi
In a nutshell – Apache NiFi is a powerful, reliable and easy to use tool to process and distribute data.
Originally designed by NSA – Yes NSA! – it was developed and used by NSA as Niagara Files. After having used it for around 8 years by NSA the technology was open sourced through Apache Software Foundation. The transfer of the technology was done through NSA technology transfer program. Please note that this isn’t the first time NSA has contributed to open source community. The previous contribution included Accumulo.
What is Hortonworks DataFlow (HDF)
Hortonworks DataFlow (HDF), powered by Apache NiFi, is the first integrated platform that solves the real time complexity and challenges of collecting and transporting data from a multitude of sources be they big or small, fast or slow, always connected or intermittently available
Hortonworks DataFlow provides a perfect compliment to Hortonworks Data Platform (HDP). Running HDF on top of HDP can open multiple opportunities to design, grab & process various different forms of data.
Common Use Cases for Hortonworks DataFlow (HDF) or Apache NiFi
Here are some of the uses cases which HDF or Apache NiFi can be used for. However the use cases aren’t limited to these. There are every increasing number of “processors” (we will take a look at what this really means later in the article) which increases it’s reach and use case.
The purpose of this article is to provide a brief introduction of “how-to” design a DataFlow using HDF running on top of HDP.
For the purpose of this example, I have used Hortonworks Data Platform standalone installation with version 2.3 and Hortonworks DataFlow NiFi version 1.1.1.
Demo – Setup overview
Let’s dig into the demo.
Thankfully Ambari provides you with the option of customization. I have got the NiFi service installed through Ambari.
Here’s how my current cluster looks like.
NiFi is installed under /opt directory and the service is listening on port 9090. You can access the NiFi UI using the link like below:
In my case since it’s a standalone box, I have the same hostname for all services (hdp-ambari-1.gagan.com).
Here’s how the interface looks to begin with.
Demo – Designing Simple DataFlow
The UI looks pretty neat with a blank canvas that you can start designing your data flow on.
We will be designing a DataFlow which will pull the data from a local directory (/source_directory) and store it in a directory inside HDFS (/destination_directory) running inside the HDP cluster I have.
The important component to understand here is that each stage or a process in data flow is called “Processor” in NiFi. In this simple dataflow we have two processors –
1. First to get the file from the filesystem;
2. Second to put the file in HDFS.
At the time of writing this article there are 111 processors. It includes a variety of different processors which can be used for various processing or ingestion or streaming. This number is definitely going to increase in future.
Let’s get started!
Demo – Test DataFlow
Now it’s time to test the DataFlow.
I will be copying some log files to /source_directory which will be transferred over to /destination_directory inside HDFS.
You can see in the screenshot above that a few files which were copied to /source_directory were copied over successfully to /destination_directory inside HDFS.
At the same time you can find the real-time statistics from the NiFi interface about the transfer and DataFlow.
Once you are done with the DataFlow, simply select all the processors and connections. Click on the “stop” icon on the top to stop the processing.
Using Hortonworks DataFlow (HDF) – Powered by Apache NiFi, we can design some really complex DataFlows. I have designed one such complex DataFlow which transfers the files from AWS S3 bucket.
I will be writing an article around setting that up. Stay tuned for that!