Documente Academic
Documente Profesional
Documente Cultură
Abstract—An essential part of building a data-driven data into a decision-making process [2]. It is useful in
organization is the ability to handle and process continuous identifying perishable insights which require immediate or
streams of data to discover actionable insights. The explosive time constraint action. However, as a variety of data stream
growth of interconnected devices and the social Web has led to processing tools have become available, understanding the
a large volume of data being generated on a continuous basis. required capabilities of streaming architectures is vital to
Streaming data sources such as stock quotes, credit card making the right design or usage choices [6].
transactions, trending news, traffic conditions, time-sensitive A typical streaming analytics system is built on top of a
patient’s data are not only very common but can rapidly three layers stack that include ingestion, processing, and
depreciate if not processed quickly. The ever-increasing volume
storage components [5]. The ingestion layer is the entry point
and highly irregular nature of data rates pose new challenges to
data stream processing systems. One such challenging but
to the streaming architecture. It decouples, automates, and
important task is how to accurately ingest and integrate data manages the flow of information from data sources to the
streams from various sources and locations into an analytics processing and storage layers. The processing layer consumes
platform. These challenges demand new strategies and systems the data streams buffered by the ingestion layer and sends the
that can offer the desired degree of scalability and robustness in output or intermediate results to the storage layer. The storage
handling failures. This paper investigates the fundamental layer is responsible for holding data in an in-memory data
requirements and the state of the art of existing data stream store for iterative computations or in databases for long-term
ingestion systems, propose a scalable and fault-tolerant data persistence [6, 7]. The stored data may be processed further
stream ingestion and integration framework that can serve as a and the analytics results are delivered to a variety of display
reusable component across many feeds of structured and and decision support tools [8].
unstructured input data in a given platform, and demonstrate Data ingestion is an area that is often overlooked, yet its
the utility of the framework in a real-world data stream importance cannot be underestimated [1]. Many organizations
processing case study that integrates Apache NiFi and Kafka for have data stored in files that need to be moved around and
processing high velocity news articles from across the globe. The processed at many different locations. Data in motion needs
study also identifies best practices and gaps for future research dataflow management [9]. Traditionally streaming analytics
in developing large-scale data stream processing infrastructure. systems are mostly limited to handling dataflows within a
local data center. However, the world has become more
Keywords - big data, data stream ingestion; data integration, connected to the extent that many organizations now operate
dataflow management; Kafka, NiFi over several data centers in different geo-locations. Streaming
analytics systems are faced with the challenge of collecting
I. INTRODUCTION and connecting huge data streams across the globe. This is an
With the evolution and increasing popularity of social important requirement in big data projects where companies
media platforms and Internet of Things (IoT), inconceivable aim to ingest a variety of data sources ranging from live
volumes of data are being generated [1]. Internet Protocol (IP) multimedia, to IoT data, and to real-time headlines from social
traffic data as predicted by Ballard et al [2] have now reached media and blogs. Another challenge is to provide security,
half a zettabyte. Greater portion of this data may not have an auditing, and provenance in a data ingestion mechanism. The
apparent use today but may be useful in the future. Some of analytical value of data entirely depends on its completeness,
this data may lose its value or be lost forever if not processed accuracy, and consistency. Achieving accurate and
immediately. It is, therefore, important to ingest, process, save continuous data ingestion and management is a complex and
the important aspects of the data and delete the portions that challenging task that requires proper planning, specialized
are not useful [3]. With the way new data sources are evolving tools, and expertise [10]. These challenges demand new
daily, more businesses will depend on being able to process strategies and systems that can offer the desired degree of
and make decisions on data streams. scalability and robustness in handling failures. Data ingestion
The challenge of data explosion has generated a lot of research has previously been targeted under different research
interest in low-latency in-memory frameworks and data initiatives such as Extract, Transform, and Load (ETL), data
stream processing systems in recent years [4]. In order to keep integration, deduplication, integrity constraint maintenance,
up with the high rate and volume of data, modern data and bulk data loading [11]. Data flow management is used in
processing systems must deliver insights with minimal latency this study to refer to the tasks of ingesting, integrating,
and high throughput [5]. Streaming data analytics is a new extracting, enriching, and distributing data streams within or
programming paradigm designed to incorporate continuous outside an analytic platform.
The contributions of this work are as follows. First, we system failures. Failures can cause the loss of large amounts
propose a scalable and fault-tolerant dataflow management of data streams which may lead to erroneous analytic results.
framework that can serve as a reusable component across Ingestion systems should incorporate high-availability
many feeds of structured and unstructured input data. Second, mechanisms that allow to operate continuously for a long time
we demonstrate the utility of the framework in a real-world in spite of failures [5]. It is, therefore, desirable for ingestion
data stream processing case study that integrates Kafka and systems to offer the desired degree of robustness in handling
HDFS in a dataflow system powered by NiFi. The paper is failures while minimizing data loss [7]. Data ingestion needs
organized as follows. Section II introduces the requirements to support high throughput, low latency and must scale to a
of a data stream ingestion system. Following the large number of both data producers and consumers [12].
requirements, we present the proposed framework and its
features in section III. Section IV details our experimental and C. Provenance and security
evaluation results. A study of related work is presented in One of the reasons modern systems can now more easily
section V. Finally, we give concluding remarks in Section VI. handle streaming data is improvements in the way message-
passing systems work. Highly effective messaging
II. REQUIREMENTS OF A STREAM INGESTION SYSTEM technologies collect streaming data from many sources—
Data ingestion layer serves to acquire, buffer and sometimes hundreds, thousands, or even millions—and
optionally pre-process data streams (e.g., filter) before they deliver it to multiple consumers of the data, including but not
are consumed by the analytics application. Important features limited to real-time applications. Effective message-passing
when considering data stream ingestion tools include the ease capabilities are needed as a fundamental aspect of the
of installing, publishing, transporting, consuming, and streaming infrastructure.
archiving streams to disk. Data ingestion system should be D. Scalability
able to support high throughput, low latency and must scale to
a large number of both data stream producers and consumers Multiple feeds with variable data arrival rates imply a
[12]. We categorized these requirements to include source varying demand for resources. An ingestion system should be
integration and pre-processing, fault tolerance and message scalable to be able to ingest increasingly large volumes of data
delivery guarantees, provenance and security, backpressure from multiple sources. The system should also demonstrate
and routing, scalability, and extensibility. elasticity through an automatic scaling in or out in order to
meet the varying demand for resources [7].
A. Source integration and preprocessing
E. Backpressure and Routing
Data streams may be sourced directly from http/Web
Sockets Application Programming Interface (API), REST Backpressure in data stream processing is a situation
API, Streaming API, IoT Hubs, or through message queuing where an ingestion or other components of a DSPS is unable
sources. One of the fundamental issues of stream computing to handle the rate at which the data streams are received. An
is the challenge of collecting and integrating data from a ingestion system should be able to buffer data in the case of
multitude of sources. The most complex and difficult part of temporary spikes in workload and provide a mechanism to
integrating data from various sources is the task of replay it later. Rate throttling is a typical example of
transforming data into a common format [13]. When the backpressure mechanism for handling fast arrivals of data
source of streaming data is diverse, for instance hundreds of streams. It is an artificial restriction to the rate at which tuples
sources emitting dozens of data formats, improving the rate of are delivered. [5].
data ingestion and efficiency of data processing becomes a F. Extensibility
challenging task [10]. Multiple applications may also be lined
Ingestion systems must be generic enough to work with a
up to consume the ingested data. It is desirable to integrate the
variety of data sources and high-level applications. A plug-
incoming streams into a single flow and then transform it in
and-play model is desired to allow extension of the offered
multiple ways to drive different applications concurrently [7].
functionality [7]. This provides the ability to add or remove
Effective data stream ingestion involves the prioritization
consumers or new functionalities at any time without
of data sources, the validation of individual files and the
changing the data ingestion pipeline. Ingestion systems should
routing of data items to a desired destination. Data stream
be able to filter any erroneous or malicious data items before
ingestion systems should be able to verify and filter sources,
transporting the data to processing or storage layers.
language, and content format to ensure that the source
integration is accurate, smooth, and free from noise (such as III. PROPOSED FRAMEWORK
duplicates). Some pre-processing steps will be required to
integrate global news data streams from various sources such The focus of this paper is dataflow management which
as RSS feeds, blogs, and social media which are mainly refers to an automated and managed flow of information
unstructured in nature. The technology for handling real-time between systems. Our goal is to design a highly-flexible
data integration is more complex than that of static data [13]. dataflow management framework for large-scale data stream
ingestion and integration with the capabilities of meeting all
B. Fault tolerance and message delivery guarantees or most of the challenges detailed in the previous section. The
Data ingestion is expected to run on a large cluster that proposed dataflow management framework is shown in Fig.
may be prone to software, hardware, and other third-party 1. The framework components (in blue) include (1) Data
stream acquisition from disparate sources (2) Data stream requirements, compliance, and security. MiNiFi 2 is an
extraction, enrichment, and integration and (3) Data stream interesting project aimed at extending NiFi’s capabilities by
distribution to various downstream systems such as data store collecting data at the edge or source of its creation and
or analytics platform. bringing it directly to a central NiFi instance.
For the purposes of our study, NiFi will enable us to
1. Data Stream 2. Data Stream 3. Data Stream
quickly build simple pipelines for prototyping, before scaling
Acquisition Integration and Distribution to full production. The above and many features of NiFi,
Extraction therefore, meets many data acquisition use case requirements.
Streaming
data B. Data Stream Extraction, Enrichment, and Integration
Streaming Analytics
Depending on the nature of the incoming data and
Merge, intended applications, several tasks such as language, noise,
Streaming Filter, Route
data
and duplicate detection, content parsing, data type
Validate transformations are performed on the ingested stream.
Storage 1) Extraction
Streaming Enrichment Data streams can be ingested in its raw format onto
data different schemas to enable a variety of different kinds of
Data stores downstream analytics. The NiFi Expression Language
provides the ability to reference attributes, compare them to
Figure 1. Dataflow management framework other values, and manipulate their values. It supports several
data types including number, decimal, date, Boolean, and
The following subsections describe each of these string. It is used heavily throughout the NiFi application for
functionalities in the framework. configuring processor properties and provides many different
A. Data Stream Acquisition functions (that takes zero or more arguments) to meet the
needs of an automated dataflow. The functions can be chained
The data acquisition is the entry point for bringing data together to create expressions for effective and efficient data
into a processing platform. It involves the ingestion of data of stream extraction and manipulations. Near-duplicate detection
different formats and from several sources. Streams of data of incoming data streams is a fundamental extraction task for
such as feeds from IoT devices or Twitter firehose arrive into effective stream ingestion. NiFi provides customizable
the system over sockets from internal or external servers. processors such as DetectDuplicate, for detecting multiple
These streams are then merged, filtered and distributed to copies of same record in a dataflow; ExecuteScript and
connected layers for processing, temporary storage, and ExecuteStreamCommand, for deduplication and filtering of
persistence. The framework supports commonly used data erroneous/malicious data items before transporting the data to
access schemes such as sockets, Representational State processing or storage layers.
Transfer (REST) API, Streaming API, or custom schemes and
2) Enrichment
modern devices or application interaction patterns such as
Enrichment is a common use case when working on data
publish-subscribe and stream protocols.
ingestion or flow management. It involves getting data from
Rather than designing a new tool for data acquisition from
external source (such as database, file, or API) to add more
the scratch, we are using an open source dataflow
details, context or information to data being ingested. Often
management system called NiFi1. Young et al [14] describe
the enrichment is done in batch using the join operation.
NiFi as a data in motion technology that uses flow-based
However, doing the enrichment on data streams in real-time
processing. It enables data acquisition, basic event processing,
is more interesting. NiFi provides processors such as
and data distribution mechanism. NiFi gives organizations a
ISPEnrichIP, LookupAttribute, and LookupRecord for data
distributed and resilient platform for building enterprise
stream enrichment tasks.
dataflows [15]. It provides the capability to accommodate
diverse dataflows being generated by the connected world. 3) Integration
NiFi enables seamless connections among databases, big data The practice associated with managing data that travels
clusters, message queues, and devices. It incorporates visual between applications, data stores, systems, and organizations
command, control, provenance (data lineage), prioritization, is traditionally called data integration [13]. Several techniques
buffering (back pressure), latency, throughput, security, for managing and integrating data in motion have been
scalability, and extensibility mechanisms [9]. We chose NiFi developed to decrease the complexity of interactions and
because it is highly configurable and provides a scalable and increase scalability [6]. Depending on the nature of the
robust solution to handle the flow and integration of data incoming data stream, integration may be achieved at the data
streams of different formats from different sources through a acquisition stage. Data integration within NiFi is achieved
cluster of machines. NiFi was designed to meet dataflow using processors such as MergeContent, MergeRecord, and
challenges which include network failures and crashes, excess PartitionRecord.
loads, corrupt data, rapid changes in organizational
1 2
https://nifi.apache.org/ https://nifi.apache.org/minifi/
C. Data Stream Distribution Streaming API and Satori channels (Big RSS and Worldwide
Big data ingestion is about moving data (especially Live Data) [16]. Next, we will describe how to achieve these
unstructured data) from its sources or producers into big data tasks using NiFi-Kafka dataflow in a cluster.
stores (such as HDFS or Cassandra) or a processing system Live News Sources Dataflow Management
(such as Storm, Flink, or Spark Streaming) via message
Twitter
this case, MiNiFi can be utilized to bring data from sources
directly to a central NiFi instance, which can then deliver data
to the appropriate Kafka topic. A more complex but
interesting scenario is a bi-directional flow which combines Figure 2. News articles processing infrastructure with a robust and scalable
the power of NiFi, Kafka, and a stream processing engine to dataflow management framework
create a dynamic self-adjusting dataflow. NiFi-Kafka
integration provides the ability to add and remove consumers B. Global News Articles Dataflow
at any time without changing the data ingestion pipeline [1]. The main source of our data, Satori [16], is a cloud-based
Combining NiFi, Kafka, and Spark Streaming provides a live platform, which provides a publish-subscribe messaging
compelling open source architecture option for building a next service called RTM and makes available a set of free real-time
generation near real time ETL data pipelines. Next, we will data feeds as part of their Open Data Channels initiative. We
evaluate the framework in a case study that involve ingesting ingest news stories from Big RSS, a live data channel in Satori
and distributing global news articles from several sources for for gathering RSS feed. It is one of the largest RSS
media monitoring. aggregators in the world with over 6.5 million feeds3. Another
very important source of streaming news stories utilized is the
IV. EXPERIMENTAL EVALUATION Twitter API platform [17]. Twitter Inc offers a unified
This section demonstrates how to consume massive data platform with scalable access to its data. There are currently
streams from streaming APIs using a global news and social two options (each with varying number of filters and filtering
media monitoring use case scenario. capabilities) for streaming real-time Tweets. We are used the
standard/free option which allows 400 keywords, 5,000 user
A. Use Case Scenario ids and 25 location boxes. There is also an enterprise option
To illustrate the utility of the dataflow management with premium operators that allows up to 250,000 filters (up
framework, we use a media monitoring use case with the to 2,048 characters each) per stream. The volume and velocity
following functional requirements. of data streams from the Twitter Streaming API depends on
• Ingest streaming news articles data from several sources the popularity of the keyword queries. The entire flow was
such as RSS feeds and Twitter Streaming API. implemented using three local process groups in NiFi. Next,
• Integrate the ingested streams with other news article we describe the output of the dataflow framework and
sources being scraped continuously from a variety of possible performance improvements.
specialized sources.
C. Dataflow Output and Performance Improvement
• Extract and filter noise such as fake news and duplicates
in the data streams. We present the output of integrated news article sources
• Route news articles to relevant consumers such as (from Twitter, Big RSS, and custom WebSocket) that is
persistent data stores or analytics engines. continuously consumed and saved in HDFS. Fig. 3 shows a
• Support data enrichment, provenance, extensibility, and screenshot of recently processed news articles. NiFi
guaranteed message delivery. automatically records, indexes, and makes available
provenance data as objects through the system. Data can be
In this case study, we’ll demonstrate how to utilize the downloaded, replayed, tracked and evaluated at numerous
dataflow management framework in Fig. 2 to fetch, extract, points along the dataflow path using the provenance user
enrich, integrate, and distribute live news stories from Twitter
3
https://www.satori.com/livedata/channels/big-rss
interface. This information is useful for troubleshooting,
optimization, and other scenarios.