Sunteți pe pagina 1din 6

A Scalable and Robust Framework for Data Stream Ingestion

Haruna Isah Farhana Zulkernine


School of Computing School of Computing
Queen’s University Queen’s University
Kingston, Canada Kingston, Canada
h.isah@cs.queensu.ca farhana@cs.queensu.ca

Abstract—An essential part of building a data-driven data into a decision-making process [2]. It is useful in
organization is the ability to handle and process continuous identifying perishable insights which require immediate or
streams of data to discover actionable insights. The explosive time constraint action. However, as a variety of data stream
growth of interconnected devices and the social Web has led to processing tools have become available, understanding the
a large volume of data being generated on a continuous basis. required capabilities of streaming architectures is vital to
Streaming data sources such as stock quotes, credit card making the right design or usage choices [6].
transactions, trending news, traffic conditions, time-sensitive A typical streaming analytics system is built on top of a
patient’s data are not only very common but can rapidly three layers stack that include ingestion, processing, and
depreciate if not processed quickly. The ever-increasing volume
storage components [5]. The ingestion layer is the entry point
and highly irregular nature of data rates pose new challenges to
data stream processing systems. One such challenging but
to the streaming architecture. It decouples, automates, and
important task is how to accurately ingest and integrate data manages the flow of information from data sources to the
streams from various sources and locations into an analytics processing and storage layers. The processing layer consumes
platform. These challenges demand new strategies and systems the data streams buffered by the ingestion layer and sends the
that can offer the desired degree of scalability and robustness in output or intermediate results to the storage layer. The storage
handling failures. This paper investigates the fundamental layer is responsible for holding data in an in-memory data
requirements and the state of the art of existing data stream store for iterative computations or in databases for long-term
ingestion systems, propose a scalable and fault-tolerant data persistence [6, 7]. The stored data may be processed further
stream ingestion and integration framework that can serve as a and the analytics results are delivered to a variety of display
reusable component across many feeds of structured and and decision support tools [8].
unstructured input data in a given platform, and demonstrate Data ingestion is an area that is often overlooked, yet its
the utility of the framework in a real-world data stream importance cannot be underestimated [1]. Many organizations
processing case study that integrates Apache NiFi and Kafka for have data stored in files that need to be moved around and
processing high velocity news articles from across the globe. The processed at many different locations. Data in motion needs
study also identifies best practices and gaps for future research dataflow management [9]. Traditionally streaming analytics
in developing large-scale data stream processing infrastructure. systems are mostly limited to handling dataflows within a
local data center. However, the world has become more
Keywords - big data, data stream ingestion; data integration, connected to the extent that many organizations now operate
dataflow management; Kafka, NiFi over several data centers in different geo-locations. Streaming
analytics systems are faced with the challenge of collecting
I. INTRODUCTION and connecting huge data streams across the globe. This is an
With the evolution and increasing popularity of social important requirement in big data projects where companies
media platforms and Internet of Things (IoT), inconceivable aim to ingest a variety of data sources ranging from live
volumes of data are being generated [1]. Internet Protocol (IP) multimedia, to IoT data, and to real-time headlines from social
traffic data as predicted by Ballard et al [2] have now reached media and blogs. Another challenge is to provide security,
half a zettabyte. Greater portion of this data may not have an auditing, and provenance in a data ingestion mechanism. The
apparent use today but may be useful in the future. Some of analytical value of data entirely depends on its completeness,
this data may lose its value or be lost forever if not processed accuracy, and consistency. Achieving accurate and
immediately. It is, therefore, important to ingest, process, save continuous data ingestion and management is a complex and
the important aspects of the data and delete the portions that challenging task that requires proper planning, specialized
are not useful [3]. With the way new data sources are evolving tools, and expertise [10]. These challenges demand new
daily, more businesses will depend on being able to process strategies and systems that can offer the desired degree of
and make decisions on data streams. scalability and robustness in handling failures. Data ingestion
The challenge of data explosion has generated a lot of research has previously been targeted under different research
interest in low-latency in-memory frameworks and data initiatives such as Extract, Transform, and Load (ETL), data
stream processing systems in recent years [4]. In order to keep integration, deduplication, integrity constraint maintenance,
up with the high rate and volume of data, modern data and bulk data loading [11]. Data flow management is used in
processing systems must deliver insights with minimal latency this study to refer to the tasks of ingesting, integrating,
and high throughput [5]. Streaming data analytics is a new extracting, enriching, and distributing data streams within or
programming paradigm designed to incorporate continuous outside an analytic platform.
The contributions of this work are as follows. First, we system failures. Failures can cause the loss of large amounts
propose a scalable and fault-tolerant dataflow management of data streams which may lead to erroneous analytic results.
framework that can serve as a reusable component across Ingestion systems should incorporate high-availability
many feeds of structured and unstructured input data. Second, mechanisms that allow to operate continuously for a long time
we demonstrate the utility of the framework in a real-world in spite of failures [5]. It is, therefore, desirable for ingestion
data stream processing case study that integrates Kafka and systems to offer the desired degree of robustness in handling
HDFS in a dataflow system powered by NiFi. The paper is failures while minimizing data loss [7]. Data ingestion needs
organized as follows. Section II introduces the requirements to support high throughput, low latency and must scale to a
of a data stream ingestion system. Following the large number of both data producers and consumers [12].
requirements, we present the proposed framework and its
features in section III. Section IV details our experimental and C. Provenance and security
evaluation results. A study of related work is presented in One of the reasons modern systems can now more easily
section V. Finally, we give concluding remarks in Section VI. handle streaming data is improvements in the way message-
passing systems work. Highly effective messaging
II. REQUIREMENTS OF A STREAM INGESTION SYSTEM technologies collect streaming data from many sources—
Data ingestion layer serves to acquire, buffer and sometimes hundreds, thousands, or even millions—and
optionally pre-process data streams (e.g., filter) before they deliver it to multiple consumers of the data, including but not
are consumed by the analytics application. Important features limited to real-time applications. Effective message-passing
when considering data stream ingestion tools include the ease capabilities are needed as a fundamental aspect of the
of installing, publishing, transporting, consuming, and streaming infrastructure.
archiving streams to disk. Data ingestion system should be D. Scalability
able to support high throughput, low latency and must scale to
a large number of both data stream producers and consumers Multiple feeds with variable data arrival rates imply a
[12]. We categorized these requirements to include source varying demand for resources. An ingestion system should be
integration and pre-processing, fault tolerance and message scalable to be able to ingest increasingly large volumes of data
delivery guarantees, provenance and security, backpressure from multiple sources. The system should also demonstrate
and routing, scalability, and extensibility. elasticity through an automatic scaling in or out in order to
meet the varying demand for resources [7].
A. Source integration and preprocessing
E. Backpressure and Routing
Data streams may be sourced directly from http/Web
Sockets Application Programming Interface (API), REST Backpressure in data stream processing is a situation
API, Streaming API, IoT Hubs, or through message queuing where an ingestion or other components of a DSPS is unable
sources. One of the fundamental issues of stream computing to handle the rate at which the data streams are received. An
is the challenge of collecting and integrating data from a ingestion system should be able to buffer data in the case of
multitude of sources. The most complex and difficult part of temporary spikes in workload and provide a mechanism to
integrating data from various sources is the task of replay it later. Rate throttling is a typical example of
transforming data into a common format [13]. When the backpressure mechanism for handling fast arrivals of data
source of streaming data is diverse, for instance hundreds of streams. It is an artificial restriction to the rate at which tuples
sources emitting dozens of data formats, improving the rate of are delivered. [5].
data ingestion and efficiency of data processing becomes a F. Extensibility
challenging task [10]. Multiple applications may also be lined
Ingestion systems must be generic enough to work with a
up to consume the ingested data. It is desirable to integrate the
variety of data sources and high-level applications. A plug-
incoming streams into a single flow and then transform it in
and-play model is desired to allow extension of the offered
multiple ways to drive different applications concurrently [7].
functionality [7]. This provides the ability to add or remove
Effective data stream ingestion involves the prioritization
consumers or new functionalities at any time without
of data sources, the validation of individual files and the
changing the data ingestion pipeline. Ingestion systems should
routing of data items to a desired destination. Data stream
be able to filter any erroneous or malicious data items before
ingestion systems should be able to verify and filter sources,
transporting the data to processing or storage layers.
language, and content format to ensure that the source
integration is accurate, smooth, and free from noise (such as III. PROPOSED FRAMEWORK
duplicates). Some pre-processing steps will be required to
integrate global news data streams from various sources such The focus of this paper is dataflow management which
as RSS feeds, blogs, and social media which are mainly refers to an automated and managed flow of information
unstructured in nature. The technology for handling real-time between systems. Our goal is to design a highly-flexible
data integration is more complex than that of static data [13]. dataflow management framework for large-scale data stream
ingestion and integration with the capabilities of meeting all
B. Fault tolerance and message delivery guarantees or most of the challenges detailed in the previous section. The
Data ingestion is expected to run on a large cluster that proposed dataflow management framework is shown in Fig.
may be prone to software, hardware, and other third-party 1. The framework components (in blue) include (1) Data
stream acquisition from disparate sources (2) Data stream requirements, compliance, and security. MiNiFi 2 is an
extraction, enrichment, and integration and (3) Data stream interesting project aimed at extending NiFi’s capabilities by
distribution to various downstream systems such as data store collecting data at the edge or source of its creation and
or analytics platform. bringing it directly to a central NiFi instance.
For the purposes of our study, NiFi will enable us to
1. Data Stream 2. Data Stream 3. Data Stream
quickly build simple pipelines for prototyping, before scaling
Acquisition Integration and Distribution to full production. The above and many features of NiFi,
Extraction therefore, meets many data acquisition use case requirements.
Streaming
data B. Data Stream Extraction, Enrichment, and Integration

Streaming Analytics
Depending on the nature of the incoming data and
Merge, intended applications, several tasks such as language, noise,
Streaming Filter, Route
data
and duplicate detection, content parsing, data type
Validate transformations are performed on the ingested stream.
Storage 1) Extraction
Streaming Enrichment Data streams can be ingested in its raw format onto
data different schemas to enable a variety of different kinds of
Data stores downstream analytics. The NiFi Expression Language
provides the ability to reference attributes, compare them to
Figure 1. Dataflow management framework other values, and manipulate their values. It supports several
data types including number, decimal, date, Boolean, and
The following subsections describe each of these string. It is used heavily throughout the NiFi application for
functionalities in the framework. configuring processor properties and provides many different
A. Data Stream Acquisition functions (that takes zero or more arguments) to meet the
needs of an automated dataflow. The functions can be chained
The data acquisition is the entry point for bringing data together to create expressions for effective and efficient data
into a processing platform. It involves the ingestion of data of stream extraction and manipulations. Near-duplicate detection
different formats and from several sources. Streams of data of incoming data streams is a fundamental extraction task for
such as feeds from IoT devices or Twitter firehose arrive into effective stream ingestion. NiFi provides customizable
the system over sockets from internal or external servers. processors such as DetectDuplicate, for detecting multiple
These streams are then merged, filtered and distributed to copies of same record in a dataflow; ExecuteScript and
connected layers for processing, temporary storage, and ExecuteStreamCommand, for deduplication and filtering of
persistence. The framework supports commonly used data erroneous/malicious data items before transporting the data to
access schemes such as sockets, Representational State processing or storage layers.
Transfer (REST) API, Streaming API, or custom schemes and
2) Enrichment
modern devices or application interaction patterns such as
Enrichment is a common use case when working on data
publish-subscribe and stream protocols.
ingestion or flow management. It involves getting data from
Rather than designing a new tool for data acquisition from
external source (such as database, file, or API) to add more
the scratch, we are using an open source dataflow
details, context or information to data being ingested. Often
management system called NiFi1. Young et al [14] describe
the enrichment is done in batch using the join operation.
NiFi as a data in motion technology that uses flow-based
However, doing the enrichment on data streams in real-time
processing. It enables data acquisition, basic event processing,
is more interesting. NiFi provides processors such as
and data distribution mechanism. NiFi gives organizations a
ISPEnrichIP, LookupAttribute, and LookupRecord for data
distributed and resilient platform for building enterprise
stream enrichment tasks.
dataflows [15]. It provides the capability to accommodate
diverse dataflows being generated by the connected world. 3) Integration
NiFi enables seamless connections among databases, big data The practice associated with managing data that travels
clusters, message queues, and devices. It incorporates visual between applications, data stores, systems, and organizations
command, control, provenance (data lineage), prioritization, is traditionally called data integration [13]. Several techniques
buffering (back pressure), latency, throughput, security, for managing and integrating data in motion have been
scalability, and extensibility mechanisms [9]. We chose NiFi developed to decrease the complexity of interactions and
because it is highly configurable and provides a scalable and increase scalability [6]. Depending on the nature of the
robust solution to handle the flow and integration of data incoming data stream, integration may be achieved at the data
streams of different formats from different sources through a acquisition stage. Data integration within NiFi is achieved
cluster of machines. NiFi was designed to meet dataflow using processors such as MergeContent, MergeRecord, and
challenges which include network failures and crashes, excess PartitionRecord.
loads, corrupt data, rapid changes in organizational

1 2
https://nifi.apache.org/ https://nifi.apache.org/minifi/
C. Data Stream Distribution Streaming API and Satori channels (Big RSS and Worldwide
Big data ingestion is about moving data (especially Live Data) [16]. Next, we will describe how to achieve these
unstructured data) from its sources or producers into big data tasks using NiFi-Kafka dataflow in a cluster.
stores (such as HDFS or Cassandra) or a processing system Live News Sources Dataflow Management
(such as Storm, Flink, or Spark Streaming) via message

Big RSS Worldwide Live Data


queuing systems such as (MQTT and Kafka). Although, there
are custom processors for connecting NiFi to big data stream News Stream
Acquisition,
processing engines such as Flink and Spark Streaming. Integration,
News Stream
However, using NiFi for delivering massive and high velocity Distribution
and Extraction
data streams to these systems in a complex multilevel with Kafka
with NiFi
analytics pipeline is not a good practice. This is because if we
were to introduce a new consumer of data, for example, a
Spark-streaming job, the flow must be changed [1].
We choose Kafka, a high-throughput distributed
messaging system which has recently become one of the most
common landing places for data within an organization. A
common scenario is for NiFi to act as a Kafka producer. In

Twitter
this case, MiNiFi can be utilized to bring data from sources
directly to a central NiFi instance, which can then deliver data
to the appropriate Kafka topic. A more complex but
interesting scenario is a bi-directional flow which combines Figure 2. News articles processing infrastructure with a robust and scalable
the power of NiFi, Kafka, and a stream processing engine to dataflow management framework
create a dynamic self-adjusting dataflow. NiFi-Kafka
integration provides the ability to add and remove consumers B. Global News Articles Dataflow
at any time without changing the data ingestion pipeline [1]. The main source of our data, Satori [16], is a cloud-based
Combining NiFi, Kafka, and Spark Streaming provides a live platform, which provides a publish-subscribe messaging
compelling open source architecture option for building a next service called RTM and makes available a set of free real-time
generation near real time ETL data pipelines. Next, we will data feeds as part of their Open Data Channels initiative. We
evaluate the framework in a case study that involve ingesting ingest news stories from Big RSS, a live data channel in Satori
and distributing global news articles from several sources for for gathering RSS feed. It is one of the largest RSS
media monitoring. aggregators in the world with over 6.5 million feeds3. Another
very important source of streaming news stories utilized is the
IV. EXPERIMENTAL EVALUATION Twitter API platform [17]. Twitter Inc offers a unified
This section demonstrates how to consume massive data platform with scalable access to its data. There are currently
streams from streaming APIs using a global news and social two options (each with varying number of filters and filtering
media monitoring use case scenario. capabilities) for streaming real-time Tweets. We are used the
standard/free option which allows 400 keywords, 5,000 user
A. Use Case Scenario ids and 25 location boxes. There is also an enterprise option
To illustrate the utility of the dataflow management with premium operators that allows up to 250,000 filters (up
framework, we use a media monitoring use case with the to 2,048 characters each) per stream. The volume and velocity
following functional requirements. of data streams from the Twitter Streaming API depends on
• Ingest streaming news articles data from several sources the popularity of the keyword queries. The entire flow was
such as RSS feeds and Twitter Streaming API. implemented using three local process groups in NiFi. Next,
• Integrate the ingested streams with other news article we describe the output of the dataflow framework and
sources being scraped continuously from a variety of possible performance improvements.
specialized sources.
C. Dataflow Output and Performance Improvement
• Extract and filter noise such as fake news and duplicates
in the data streams. We present the output of integrated news article sources
• Route news articles to relevant consumers such as (from Twitter, Big RSS, and custom WebSocket) that is
persistent data stores or analytics engines. continuously consumed and saved in HDFS. Fig. 3 shows a
• Support data enrichment, provenance, extensibility, and screenshot of recently processed news articles. NiFi
guaranteed message delivery. automatically records, indexes, and makes available
provenance data as objects through the system. Data can be
In this case study, we’ll demonstrate how to utilize the downloaded, replayed, tracked and evaluated at numerous
dataflow management framework in Fig. 2 to fetch, extract, points along the dataflow path using the provenance user
enrich, integrate, and distribute live news stories from Twitter

3
https://www.satori.com/livedata/channels/big-rss
interface. This information is useful for troubleshooting,
optimization, and other scenarios.

Figure 5. Back Pressure in NiFi

NiFi has several repositories, the database repository


keeps track of all changes made to the dataflow using the user
interface (UI). The FlowFile repository holds all the attributes
of a FlowFile that are being processed. It allows NiFi to pick
up where it left off in the event of a restart of the application
due to unexpected power outage, inadvertent user/software
restart, or upgrade. If FlowFile repository becomes corrupt or
runs out of disk space, state of the FlowFiles can be lost. The
content repository is where all the actual content of the files
being processed by NiFi resides. It is also where multiple
historical versions of the content are held if data archiving is
Figure 3. Screenshot of processed news articles in HDFS enabled. Content repository can be very I/O intensive
depending on the nature of the dataflow. The provenance
NiFi also provides a feature called data lineage (see Fig. repository keeps track of the lineage on both current and past
4), a visual representation of FlowFile content (generated FlowFiles. Like any application, the overall performance is
from the news processing case study) which helps to track data governed by the performance of the individual components.
from its origin to its destination. The basic NiFi configuration is far from ideal for high
volume and high-performance dataflows. Some NiFi
processors can be CPU, I/O and/or memory intensive. This is
because NiFi is a Java application, and therefore, runs inside
a Java Virtual Machine (JVM). Massive and high velocity
flows of continuous data streams can cause performance
bottlenecks in the memory, input/output (I/O) disk/network
usage, or the central processing unit (CPU). However, the
default NiFi core setting can be configured to run with
improved performance.
V. RELATED WORK
Grover and Carey [7] developed an ingestion support (data
feed) for a wide variety of data sources and applications in
Figure 4. Data Lineage in NiFi AsterixDB, an open-source big data management system. The
system was designed to be used in AsterixDB and cannot be
NiFi also provides a feature for graphical representation of utilized for other use cases and applications outside
the FlowFiles status history. This is useful in viewing various AsterixDB. Park and Chi [18] proposed an open source based
statics such as the number of bytes read, written, in, and out in system for ingesting logs to a centralized HDFS clustered
5 minutes. It also provides a feature called Back Pressure for servers. The study lays a good foundation in this area, but the
indicating how much data should be allowed to exist in a framework lacks in fundamental theoretical aspects and
queue before the source component is no longer scheduled to justification of the use of various components of the system.
run. NiFi provides two configuration elements (object Meehan et al [11] discussed the major functional requirements
threshold and data size threshold) for Back Pressure. An of a streaming ETL. Their study, however, does not consider
object threshold specifies the maximum number of FlowFiles decoupling the ingestion system from the other components,
that should be queued up before applying back pressure which is highly recommended in modern and large-scale
(default value is 10,000 objects). A data size threshold architectures.
specifies the maximum amount of data that should be queued Qiao et al [19] developed Gobblin, a generic data ingestion
up before applying back pressure (default value is 1 GB). Fig. framework at LinkedIn. Gobblin was mainly driven by the
5 illustrates this concept when Kafka was down during our fact that LinkedIn’s data sources have become increasingly
evaluation due to system maintenance and shows a maximum heterogeneous. It provides adaptors for commonly accessed
of 10,000 objects and red color. data sources such as MySQL, S3, Kafka, and Salesforce.
Other similar systems include Scribe [20], a messaging system
as Facebook; Siphon [21], a messaging system in Microsoft
Azure HDInsight that utilize Kafka. Marcu et al [12] [2] C. Ballard et al., Ibm infosphere streams: Assembling continuous
developed KerA, a data ingestion framework that alleviate the insight in the information revolution. IBM Redbooks, 2012.
[3] B. Bryan. (2016). Integrating Apache NiFi and Apache Kafka.
limitations of Kafka and other ingestion systems. The study Available:
focused on improving throughput, latency and scalability but https://community.hortonworks.com/articles/57262/integrating-
do not consider issues such as provenance and extensibility. apache-nifi-and-apache-kafka.html
Our study extends all these related works by utilizing [4] T. Dunning and E. Friedman, Streaming architecture: new designs
open-source NiFi-Kafka integration to provide robustness and using Apache Kafka and MapR streams. " O'Reilly Media, Inc.", 2016.
[5] O.-C. Marcu et al., "Towards a Unified Storage and Ingestion
scalability as well as the ability to add and remove consumers Architecture for Stream Processing," in Second Workshop on Real-time
at any time without changing the data ingestion pipeline. & Stream Analytics in Big Data Colocates with the 2017 IEEE
International Conference on Big Data, 2017.
VI. CONCLUSIONS [6] A. G. Psaltis, Streaming Data: Understanding the Real-Time Pipeline.
Data ingestion is an essential part of companies and Manning Publications Company, 2017.
[7] R. Grover and M. J. Carey, "Data Ingestion in AsterixDB," in EDBT,
organizations that collect and analyze large volumes of data. 2015, pp. 605-616.
Continuous data streams usually arrive into big data [8] M. D. de Assuncao, A. da Silva Veith, and R. Buyya, "Distributed data
processing and management systems from external sources stream processing and edge computing: A survey on resource elasticity
and are either incrementally processed or used to populate a and future directions," Journal of Network and Computer Applications,
persisted dataset and associated indexes. To keep pace with vol. 103, pp. 1-17, 2018.
[9] D. Baev, "Managing Data in Motion with the Connected Data
massive and fast-moving data, stream processing systems Architecture," in 4th Big Data & Business Analytics Symposium, 2017.
must be able to ingest, process, and persist data on a [10] P. Sangat, M. Indrawan ‐ Santiago, and D. Taniar, "Sensor data
continuous basis. Developing an infrastructure for ingesting management in the cloud: Data storage, data ingestion, and data
large-scale, multi-source, high velocity, and heterogeneous retrieval," Concurrency and Computation: Practice and Experience,
data streams involve a careful study of how these streams of vol. 30, no. 1, p. e4354, 2018.
data are produced and consumed. [11] J. Meehan, C. Aslantas, S. Zdonik, N. Tatbul, and J. Du, "Data
Ingestion for the Connected World," in CIDR, 2017.
A data stream ingestion system should be scalable, robust, [12] O.-C. Marcu et al., "KerA: Scalable Data Ingestion for Stream
and extensible to be able to support data flow between many Processing," in ICDCS 2018-38th IEEE International Conference on
independent data producers and consumers. This study Distributed Computing Systems, 2018, pp. 1-6: IEEE.
proposes and developed a scalable and fault-tolerant dataflow [13] A. Reeve, Managing data in motion: data integration best practice
management framework that can serve as a reusable techniques and technologies. Newnes, 2013.
[14] R. Young, S. Fallon, and P. Jacob, "An Architecture for Intelligent
component across many feeds of structured and unstructured Data Processing on IoT Edge Devices," in Computer Modelling &
input data. We demonstrated the utility of the framework in a Simulation (UKSim), 2017 UKSim-AMSS 19th International
real-world data stream processing case study that integrates Conference on, 2017, pp. 227-232: IEEE.
Kafka and HDFS in a data flow system powered by NiFi. We [15] B. M. West. (2018). Integrating IBM Streams with Apache NiFi -
showed sample outputs from our initial experimental work Streamsdev. Available:
and discussed how the system can be configured for improved https://developer.ibm.com/streamsdev/docs/integrating-ibm-streams-
apache-nifi/
performance. [16] (2018). Satori: Live Data Channels. Available:
Future work will explore the deployment, evaluation, and https://www.satori.com/livedata/channels
practicality of improving the performance of the system by [17] (2018). Streaming Realtime Tweets. Available:
trying various NiFi and Kafka settings for different https://developer.twitter.com/en/docs/tweets/filter-realtime/overview
applications. We will also conduct comparative experiments [18] J. Park and S.-y. Chi, "An implementation of a high throughput data
ingestion system for machine logs in manufacturing industry," in
with other open source and commercial data ingestion tools in Ubiquitous and Future Networks (ICUFN), 2016 Eighth International
order to provide a scientific study that can help the community Conference on, 2016, pp. 117-120: IEEE.
in choosing the right framework for different applications. [19] L. Qiao et al., "Gobblin: Unifying data ingestion for Hadoop,"
Proceedings of the VLDB Endowment, vol. 8, no. 12, pp. 1764-1769,
ACKNOWLEDGMENT 2015.
[20] G. J. Chen et al., "Realtime data processing at facebook," in
Special thanks to Southern Ontario Smart Computing for Proceedings of the 2016 International Conference on Management of
Innovation Platform (SOSCIP) and IBM Canada for Data, 2016, pp. 1087-1098: ACM.
supporting this research project. [21] A. Thomas. (2018). Siphon: Streaming data ingestion with Apache
Kafka | Blog | Microsoft Azure. Available:
REFERENCES https://azure.microsoft.com/en-ca/blog/siphon-streaming-data-
ingestion-with-apache-kafka/
[1] A. Morgan, A. Amend, D. George, and M. Hallett, "Mastering Spark
for Data Science," 2017.

S-ar putea să vă placă și