Getting Started with Apache NiFi

Introduction

Apache NiFi is a dataflow system that is currently under incubation at the Apache Software Foundation. NiFi is based on the concepts of flow-based programming and is highly configurable. NiFi uses a component based extension model to rapidly add capabilities to complex dataflows. Out of the box NiFi has several extensions for dealing with file-based dataflows such as FTP, SFTP, and HTTP integration as well as integration with HDFS. One of NiFi’s unique features is a rich, web-based interface for designing, controlling, and monitoring a dataflow.

Build and Install

Since NiFi is newly added to the incubator, it does not yet have released artifacts. The full details are available in the NiFi developer quickstart guide, but we’ll summarize them here. To get started with NiFi you should clone the NiFi source code from the Apache git repository and checkout the develop branch:

git clone https://git-wip-us.apache.org/repos/asf/incubator-nifi.git
cd incubator-nifi
git checkout -b develop origin/develop

Now we’re ready to build the source. NiFi requires a recent Java 7 JDK and Apache Maven 3.x. NiFi uses a WAR-like packaging format called NAR to bundle and isolate extension dependencies. Before building the rest of the code base, we need to build the nar-maven-plugin which is used to create the NAR bundles:

cd nar-maven-plugin
mvn install

Once the nar-maven-plugin is built, we can safely build the rest of the project:

cd ..
mvn install

The final step is to build a tarball that you can install on a Hadoop cluster:

cd assembly
mvn assembly:assembly
scp target/nifi-0.0.1-SNAPSHOT-bin.tar.gz edge.hadoop.example.com:
ssh edge.hadoop.example.com
tar -zxf nifi-0.0.1-SNAPSHOT-bin.tar.gz

After installing NiFi, we can start it in the backgrand using the nifi.sh script:

cd nifi-0.0.1-SNAPSHOT
bin/nifi.sh start

Build Your First Dataflow

Now you can open up the NiFI UI in your web browser. By default the web UI will run on port 8080:

Apache NiFi User Interface

When the UI first opens you’ll see a blank canvas that looks like a drafting board. NiFi builds up data flows by laying down processors and then drawing connections to specify the relationship between them. Each relationship between processors is backed by a queue through which data items, called FlowFIles, will flow. These abstractions keep processors independent of one another. For our first data flow we’ll create a local file dropbox and configure NiFi to automatically move files from the local dropbox into HDFS. 

Let’s start by dragging a processor onto the canvas. Click on the processor icon Add Processor in the toolbar and drag it to the canvas. This will bring up the Add Processor dialog where we can select the type of processor we want. We’ll start with the GetFile processor which picks up files from the local file system and creates a FlowFile with the contents of each file:

Add the GetFile processor

You can use tags or search criteria to make it easier to find the processor you’re looking for. After the processor is added, we still have to configure it with the location of our dropbox directory. Right click on the processor and click Configure:

Configure a processor

This will bring up the Configure Processor dialog. In the Settings tab we will customize the name. Then click in the Properties tab and set the Input Directory. Feel free to play around with the other settings. Each setting has a help dialog that can be triggered by clicking on the question mark icon next to the property.

GetFile configuration

When you’re done click Apply. Next we’ll add a PutHDFS processor to write files into HDFS. Add the processor using the same method as the GetFile processor. You can use the Hadoop tag or search for PutHDFS in the Add Processor dialog to find the right processor. We’ll configure the PutHDFS processor by setting the output directory and pointing NiFi at our HDFS configuration files. On my system they were in /etc/hadoop/conf/core-site.xml and /etc/hadoop/conf/hdfs-site.xml. The directory may differ on your system but you should be sure to include both the core-site.xml and hdfs-site.xml files.

Set location of Hadoop config files

Once that is done we need to configure the relationship between the GetFile processor and the PutHDFS processor. For this example we just want to copy every file picked up by the GetFile processor into HDFS so we can add the success relationship from the GetFile processor to the PutHDFS processor. Do this by hovering over the GetFile processor and clicking and dragging the arrow icon from the GetFile processor to the PutHDFS processor.

Add a relationship

Make sure the checkbox next to success checked under the For relationships section of the Create Connection dialog then click add. Before we can start the data flow we need to handle the outgoing relationships of the PutHDFS processor. Since we’re only interested in writing the data to HDFS we can set the failure and success relationships to auto terminate by clicking the checkboxes under Auto terminate relationships which is found in the Settings tab of the Configure Processor dialog. When you’re done, click Apply:

Terminate relationships

Finally, start the data flow by click on the green start icon Start in the toolbar. After the processors start you can copy files into the /dropbox directory on the local file system and they’ll be copied into HDFS. You can monitor the dataflow including the number of files that have been picked up, the number of bytes written, and the number of files and bytes currently queued between the processors:

data flows

This was a very simple data flow, but you may want to do something more complex like route files based on attributes such as file name or convert data from one format to another before writing the file to HDFS.

Integrating Kite with NiFi

The Kite SDK is a powerful data API that makes it easier to put data into Hadoop and to work with data once it’s loaded. One of the ways that Kite can be used is to read CSV files and put them into HDFS in Avro or Parquet format using command line tools. While the command line is great if we have a few files we want to load, they become more cumbersome when you want to load a regularly updating batch of data into HDFS. NiFi is an excellent tool for automating these types of workflows.

For our example, we’re going to use NiFi to ingest CSV data found in the MovieLens dataset. To keep the focus on NiFi, we’ll assume that you’ve followed along this example to ingest MovieLens data using the command line. We’ll be replacing the csv-import step with a NiFi-based data flow.

To speed up our integration, we’ll make use of the ExecuteStreamCommand processor which runs an executable and streams the contents of a FlowFile to it via standard input. One of the current limitations of the Kite csv-import command is it can’t import from standard input. We’ll get around this limitation by writing a script which will read from stdin, write to a temporary file and then execute the csv-import command:

#!/bin/bash

UUID=`uuidgen`
FILE=/tmp/${UUID}.csv

while read data; do
    echo $data >> ${FILE};
done

others=""
last=""
while [[ $# -gt 0 ]]; do
    others="${others} ${last}"
    last=$1
    shift
done

kite-dataset csv-import ${others} ${FILE} ${last}

rm ${FILE}
exit

Save this script to a file called nifi-csv-import.sh in your home directory. Now lets start setting up the dataflow. We’ll start with GetFile processor as before, but this time we’ll feed the data to a RouteOnAttribute processor. Since we have to different types of input files and we’re writing to two different datasets, we’ll route based on the file name. Configure the RouteOnAttribute processor by adding two properties:

Property Value
movies ${filename:startsWith(‘u.item’)}
ratings ${filename:startsWith(‘u.data’)}

This will route FlowFiles that start with u.item to a relationship called movies while files called u.data are routed to ratings. There is also a relationship called unmatched that will receive any FlowFiles that don’t match one of the attribute matchers we configured. For our example, we can auto terminate the unmatched relationship. We’ll then add two ExecuteStreamCommand processors, one for each dataset. Configure the one to write to the movies dataset with the following properties:

Property Value
Command Path /home/${USER}/nifi-csv-import.sh
Command Arguments –delimiter;|;movies

Replace the <tt>${USER}</tt> with the username you started NiFi as. And use these properties for writing to the ratings dataset:

Property Value
Command Path /home/${USER}/nifi-csv-import.sh
Command Arguments ratings

Next connect the movies relationship from the RouteOnAttribute processor to the ExecuteStreamCommand processor for writing to the movies dataset and the ratings relationship to the processor for writing to the ratings dataset. Be sure to auto terminate the outgoing relationships of the ExecuteStreamCommand processors so you can start your dataflow. When everything is configured, you should see a data flow that looks like this:

kite dataflow

Summary

NiFi is a powerful and flexible data flow tool that can be used to setup simple file dropboxes or to route a variety of data to complex processors. NiFi’s graphical editor makes it easy to configure, update, and monitor dataflows. With general purpose processors like the ExecuteStreamCommand you can easily integrate existing data processing tools into your custom dataflows.

Tweet about this on TwitterShare on FacebookShare on LinkedIn

'Getting Started with Apache NiFi' have 13 comments

  1. December 22, 2014 @ 6:25 pm Joe Witt

    Joey: Great post. I’d add for those interested that a very common scenario before loading data into HDFS is to merge content to meet some size or time threshold. The ‘MergeContent’ processor in NiFi does exactly that. You can merge like items based on a common attribute into bins which build up until the thresholds/criteria have been met. So let’s say you want to load data into Hadoop with bundles of 128MB or 2 minutes – whichever comes first. And you want to make sure there are no more than ‘1000’ items in the bundle or if reaches 120 MB you’re happy with letting it through. You can do (assuming defaults otherwise):

    Maximum Number of Entries = 1000
    Minimum Group Size = 120 MB
    Max Bin Age = 2 mins

    And let it run. This processor supports a wide range of merging strategies and considerations. This is really useful for cases where you have high speed data sources sending you data which some consumers you’ll want to feed in real-time and some consumers you’ll want to batch up and load in bulk.

    Thanks
    Joe

    Reply

    • December 22, 2014 @ 10:07 pm Joey Echeverria

      Thanks for the tip Joe! I feel a follow-up “Best Practices for using Apache NiFi with Hadoop” would make sense :)

      -Joey

      Reply

  2. December 31, 2014 @ 2:36 am Mark Payne

    Joey,

    I’ll point out, too, that the typical use case for PutHDFS would not auto-terminate the ‘failure’ relationship. Instead, you can connect failure back to PutHDFS. This allows you to keep the data in your flow and keep retrying if you fail to put it to HDFS.

    -Mark

    Reply

  3. January 15, 2015 @ 8:38 pm What is Apache NiFi? – Keep-It-Simple-Tech-Docs

    […] what is in the user documentation. But if you want to get started with NiFi, I suggest reading this excellent blog post. Then, check back here for more tips. Also, check out all that NiFi has to offer by visiting its […]

    Reply

  4. January 28, 2015 @ 8:22 pm Matt

    It looks like some of the directory names have changed, you now need to do:

    cd nifi-nar-maven-plugin
    mvn install

    then

    cd ../nifi
    mvn install

    then

    cd ../nifi-assembly
    mvn assembly:assembly

    Trouble is, mvn assembly:assembly now isn’t working for me – Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.5.2:assembly (default-cli) on project nifi-assembly: Error reading assemblies: No assembly descriptors found. :(

    Reply

    • March 4, 2015 @ 4:49 pm Joey

      Hey Matt!

      You’re right, the build process has changed. Fortunately, the process has become a lot simpler. The project released version 1.0.0-incubating of the nar-maven-plugin, so you no longer have to build it yourself during the build. They also moved the nifi-assembly module under the main nifi project. That means the new steps should be:

      cd nifi
      mvn clean install

      Then you can get the assembly in the nifi-assembly/target subdirectory:

      ls nifi-assembly/target
      archive-tmp nifi-0.0.2-incubating-SNAPSHOT-bin.tar.gz
      maven-shared-archive-resources nifi-0.0.2-incubating-SNAPSHOT-bin.zip
      nifi-0.0.2-incubating-SNAPSHOT-bin

      The most up-to-date build instructions are in the README in the project itself:

      https://github.com/apache/incubator-nifi/blob/develop/nifi/README.md

      I’ll see if I can update this post so as to not confuse people in the future.

      Also, if you cloned the repository earlier and updated, you might have some phantom directories (like the top-level nifi-assembly) that can be confusing. You can run git status to see what is still part of the project. Worst case, you can wipe out your clone and start over.

      Reply

  5. March 2, 2015 @ 1:54 am Madhu Jahagirdar

    How is Nifi different then fluentd, logstash ?

    Reply

    • March 26, 2015 @ 3:38 am Joe Witt

      From a super high-level they all seem to be in/around the same space of connecting data sources and data sinks in ways that make developers lives easier. I am no expert in fluentd or logstash so I’ll simplly focus on the one I do know well which is apache nifi. NiFi offers some very interesting features which for me as someone who had a full-time job of connection systems was very compelling.

      First, i wanted to be able to understand and effect change on how those dataflows work both visually and in real-time. NiFi provides an HTML5 friendly UI to do just that.

      Second, I wanted detailed accounting of every operation about the data that the dataflow broker performed. This is things like where the data came from, what did the broker do to it, was it aggregated with other items into something larger, was it split from a larger item into smaller things, where was it sent, did I apply any transforms, etc… NiFi provides that through a capability called data provenance.

      You can learn more about that the nifi.incubator.apache.org site. Also, I’ll be talking about these sorts of things at APACHECON in Austin in April. In that talk I ill go into greater detail about why those two feature spaces in particularly really matter.

      Thanks
      Joe

      Reply

  6. May 27, 2015 @ 10:11 pm Surendra

    I am new to nifi and have a very naive question. Is nifi something that can be used as a ETL tool? Can it be used as a record processor, like aggregator, scan , transform data?
    Or is it just a good integration tool. Is there any similarities between pentaho data integration and nifi or are these two totally different things.

    Reply

    • June 2, 2015 @ 5:19 am Joe Witt

      NiFi can certainly be used to route, transform, and mediate between systems. In the case of transforms that can be aggregation, splitting, enrichment, filtering, etc.. In the case of mediation it can pull data via any number of protocols like JMS, SFTP, HTTP, etc.. and deliver to a variety of protocols. There are similarities with Pentaho’s integrations tools only in that they both do integration like functions. But nifi provides many unique features. Feel free to send more questions to dev@nifi.incubator.apache.org

      Thanks
      Joe

      Reply

  7. June 1, 2015 @ 5:27 am Sandeep Gunnam

    When Apache has a data pipelining tool like Faclon. Why would we want another tool life NiFi. One reason i could think of is Falcon is more integrated with Hadoop ecosystem while NiFi is completely work independently. Are there more reasons? Also are there some examples of using NiFi Rest API’s?

    Reply

    • June 2, 2015 @ 5:22 am Joe Wittons

      Hello

      There are several dataflow tools under the apache umbrella and *many* more outside of apache. It is accurate that NiFi lives independently of Hadoop as an operating environment though nifi can send and receive data from HDFS and has been used to integrate with other systems within the Hadoop ecosystem as well. But NiFi provides some unique features like real-time command and control of a 2D representation of the dataflow and powerful chain of custody (data provenance) features. Feel free to ask more questions at dev@nifi.incubator.apache.org

      Thanks
      Joe

      Reply

  8. September 10, 2015 @ 7:31 am Vijaya Narayana Reddy

    Joe,

    I just started understanding Ni-Fi as I work closely with Hortonworks hadoop platform. Just trying to understand how this is different from many other ingestion tools like Flume, Kakfa etc.

    Also, how does Ni-Fi run? will it run as a client-only library or will it be deployed as a cluster of machines for processing data and the cluster sits completely outside an Hadoop cluster? I know that its optional to integrate Ni-Fi with Hadoop. So just wondering how these things work?

    Also, for IoT devices, how is Ni-Fi more suited than other frameworks? For example, if I have some IoT devices, for example, say smart meters which has capability to transmit data wirelessly, where does Ni-Fi connect to smart meters? Will it be the first interface point to those IoT devices or will the data first be pulled into some repository and from there it would be pulled into a Ni-Fi cluster for processing?

    Please let me know your thoughts.

    Thanks
    Vijay

    Reply


Would you like to share your thoughts?

Your email address will not be published.