Documente Academic
Documente Profesional
Documente Cultură
View on Github
Open Source ETL: Apache NiFi vs
Streamsets
Choosing between mainstream open source ETL projects
While working with Cube.js framework we've seen a lot of diffrent ETL
tools used by data engineers nowadays. Most of them require writing SHARE THIS ARTICLE
code. But there are some visual ETL you can try as well. We asked
Dmitry Dorofeev, Head of R&D at Luxms Group, to tell us about his
experience with comparing Apache NiFi and Streamsets.
***
Our team at Luxms Inc. has recently faced a boring data integration
problem: when some data is stored in Hadoop, some in Oracle, and a
little bit is in Excel. The goal was to ETL all that data into Greenplum and
finally provide some BI on top of it.
Dataflow Programming
Programmers, analysts, and even managers often draw a box and
arrow diagram to illustrate some flows. You can even use these boxes
and arrows to create programs. We can track such attempts back to the
1960s when the Dataflow Programming paradigm was born in MIT.
Yes, you don’t have to know any programming language. You just use
ready-made “processors” represented with boxes, connect them with
arrows, which represent exchange of data between “processors,” and
that’s it.
Almost anything can be a source, for example, files on the disk or AWS,
JDBC query, Hadoop, web service, MQTT, RabbitMQ, Kafka, Twitter, or
UDP socket.
Sinks are basically the same as sources, but they are designed for
writing data.
Luckily, there are two open source visual tools with the web interface:
Apache NiFi and StreamSets Data Collector (SDC). NiFi was donated
by the NSA to the Apache Foundation in 2014 and current development
and support is provided mostly by Hortonworks. SDC was started by a
California-based startup in 2014 as an open source ETL project
available on GitHub. The first release was published in June 2015.
Both products are written in Java and distributed under the Apache 2.0
license.
Releases 57 113
Apache NiFi
You can terminate outputs with checkboxes, so Apache NiFi will ignore
terminated outputs and will not send any FlowFiles there.
If you are not yet impressed, how about different queue policies like
FIFO, LIFO, and others you can apply to queues in connections?
Even with these awesome features and great architecture, I was not
very comfortable with the Apache NiFi user interface. It is definitely
usable, but not sexy.
One nice thing about Streamsets is that it can process binary data.
Some sources, such as Kafka Consumer, can read messages from the
Kafka topic and pass them to other processors or external systems
without parsing the structure of the binary message into the record
format. This allows us to forward the efficient data to some other
destination with minimum overhead.
The more powerful option is the whole file data format, supported by
several origins, including S3, directory, FTP, and more. With the whole
file format, the file is not parsed, but file metadata and a reference to
the content is sent along the pipeline. Processors can optionally act on
the content – script evaluators and custom processors can get an input
stream to the content. But in the default case, once the whole file
record arrives at the destination, the data is streamed directly from its
source.
Even though there are some complaints about lack of binary data
support in Streamsets, the whole file support has been there since
version 1.6.0.0, released in September 2016.
That means that you always start your dataflow from the beginning after
you make any changes in it with Streamsets. With Apache NiFi you
have a chance to stop a misbehaving processor, fix it, and start again.
Hopefully, queued FlowFiles will be sent to the fixed processor and you
will not miss the data.
But that doesn’t mean that Streamsets dataflows are harder to debug.
Actually it is easier, you have a nice-looking live dashboard displaying a
lot of statistics for every processor while your dataflow is running.
Errors are cleanly presented as red numbers on the processor icon and
you can see individual errors for every faulty record with a mouse click.
You may even put record filters on the connections between processors
to inspect records in question. Filters can be applied while your
dataflow is running, so I used it as live debugging tool.
Origins: they get data from the external sources. You may have only
one Origin Processor in your dataflow.
I am definitely more happy with the clean Apache NiFi architecture with
just Processors and Controller Services, but the Streamsets design is
also fine and can be quickly picked up.
UI
Apache NiFi
There is not much to say about the Apache NiFi UI. It feels spartan, and
it is very easy to follow, thanks to the great architecture with minimum
concepts. Probably the only drawback I discovered is that Apache NiFi
will not autosize text fields for your long SQL queries, so you will have
to manually resize popup text fields every time you want to edit it.
Streamsets
The first thing I quickly get annoyed with is the absence of Controller
Services, especially for JDBC settings. You need to fill in all JDBC
settings for every processor that reads data from the same JDBC
source. There is just no user-friendly way to reuse such information.
Before you can run your dataflow, Streamsets will check each
processor inside your dataflow to make sure all processors are
correctly configured. It sounds like a good thing, and it helped me
sometimes, but other times – harmed me. In Apache NiFi you can have
disconnected processors and I usually leave them so for debugging
purposes. In Streamsets you can not do the same, since all the
processors must be connected to make dataflow pass validation.
Streamsets has syntax highlighting for SQL which is a nice feature, but
not always useful. Our data engineer creates heavy SQL queries which
can easily be a hundred lines long. The syntax highlighting process
becomes slow and that results in another annoyance. If you edit the last
lines of the long SQL query the caret unexpectedly moves to the
beginning and what you type appears on the first line.
Conclusion
I made a very brief introduction to Apache NiFi and Streamsets. They
have plenty of useful features not covered in this blog post. Even if you
do not find the required built-in or third-party processors, you can
always use Python, Javascript, R, or even Apache Spark to program
your complex data transformation logic in the Apache NiFi or
Streamsets dataflows.
Both Apache NiFi and Streamsets are mature, open source ETL tools.
They have very similar functionality and the only way to make a concise
choice is to try both! That’s what I did. Even after 3 months of running
both products I can not see a clear winner.
Apache NiFi
Pros:
Data Provenance
Cons:
Streamsets
Pros:
Sexy UI
ETL Tools List: Overview & Pricing ETL vs ELT: Considering the Data Collection Tools for Events
by Kristina Khlebnikova on April 11, 2018 Advancement of Data Warehouses Analytics
by Artyom Keydunov on April 04, 2018 by Artyom Keydunov on March 28, 2018