Sunteți pe pagina 1din 2

The pathologies of Big Data

The authors of this paper define big data, as working with any sufficient large
set of data that cannot be managed using the standard procedures and
technologies of the day. From this definition, the authors then proceed to
highlighting some problems with the current set of technologies and techniques that
need to be addressed in order to succeed in a big data.
The first problem mentioned lies in extracting and working with data that
comes from traditional relational databases, although moving data into a relational
database can be done efficiently. Additionally, relational databases tend to be very
inefficient at storing data, particularly data with some sort of sequencing or ordering
to it. For example, time series data is not stored in a manner where it can be easily
queried. Since in practice users need either time series, or other sequenced
datasets, we need a new model of database that can better query efficiently this
type of data.
The second problem mentioned lied in when we run out of memory while
processing large datasets. When many applications run out of primary memory,
they move data to secondary memory, such as a hard drive. The problem with this
is that the program then runs a great deal slower. This problem is further
compounded if we use random access to the memory instead of sequential access.
For other applications, however, once they run out of primary memory they can no
longer continue to operate. Since Jacobs wrote this article in 2009, hardware
developers have made strides to addressing specific the problems surrounding big
data. For example, NVidia has developed a line of graphics cards called Tesla, which
are designed for working with and processing big data on servers. The Tesla series
of graphics card include a lot more primary memory as well as allow for much larger
bandwidth for data transfers, which facilitates better performance. (CITE THIS)
The third and final bottleneck discussed for working in big data is working in a
distributed environment. There are several problems with distributed computing in
big data including that not operations can be distributed and that it may be hard to
balance the load across all nodes. Also if computations require a lot of
communicating between different nodes then you might not see any improvements
from the distributed network of nodes, although the authors admit that this problem
has many simple solutions. Another problem is that if some part of the distributed
system fails, then the whole database or program that is running on it will be in
jeopardy. Since the time of writing new databases models have been developed
that work with large data sets more easily, such as Hadoop. Hadoop is explicitly
designed to work with large datasets over a distributed computing system, and
addresses many of the problems that Jacobs discusses in the article, such as if
there is a hardware failure somewhere in the distributed system, then Hadoop can
automatically handle it. (CITE THIS) Hadoop also takes greater advantage of data
locality than parallel relational databases do by automatically separating data in
blocks and spreading them across the distributed system. (CITE THIS)

References:
http://www.nvidia.com/object/tesla-k80.html
http://hadoop.apache.org/
http://www.datascienceassn.org/content/data-locality-hpc-vs-hadoop-vs-spark

S-ar putea să vă placă și