Sunteți pe pagina 1din 5

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882

Volume 4, Issue 7, July 2015

Big Data Processing Model from Mining Prospective


1

Swathi Sama1, D. Venkateshwarlu2 , Prof. Ravi Mathey3

Department of Computer Science and Engineering, JNT University, Hyderabad


Department of Computer Science and Engineering, JNT University, Hyderabad
3
Head of Department of Computer Science and Engineering, JNT University, Hyderabad

ABSTRACT
In the Internet era, the volume of data we deal with has
grown to terabytes and petabytes. As the volume of data
keeps growing, the types of data generated by
applications become richer than before. As a result,
traditional relational databases are challenged to capture,
store, search, share, analyze, and visualize data.
Traditional data modeling focuses on resolving the
complexity of relationships among schema-enabled data.
However, these considerations do not apply to nonrelational, schema-less databases. As a result, old ways
of data modeling no longer apply. We need a new
methodology to manage big data for maximum business
value. HACE theorem that characterizes the features
of the Big Data revolution, and proposes a Big Data
processing model, from the data mining perspective
which is disusing in this paper.

might be big, for others 100GB might be big, and


something else for others. This term is qualitative and it
cannot really be quantified. Hence we identify Big Data
by a few characteristics which are specific to Big Data.
These characteristics of Big Data are popularly known as
Volume, Velocity, and Variety as shown (in fig:1)
below.

Keywords - autonomous sources, big data, data mining,


evaluation of complex data, processing model.

I.

INTRODUCTION

The mantra of the moment, in every field from retail to


healthcare, is Big Data defined as being data sets
that are too large and complex to manipulate with
standard methods or tools. Analyzing these data sets is
quickly becoming the basis for competition, productivity
and innovation; in fact, some predict Big Data will be as
important to business and society as the Internet has
become, and it is being used to predict where and when
crimes will occur, flues will strike, where traffic will
snarl all very useful for deploying limited resources
like police forces, health care professionals or traffic
lights.

II.

CHARACTERISTICS OF BIG DATA

When do we say we are dealing with Big Data? For


some people 1TB might seem big, for others 10TB

Fig1: 3 v's of Big data


Volume refers to the size of data that we are working
with. With the advancement of technology and with the
invention of social media, the amount of data is growing
very rapidly. This data is spread across different places,
in different formats, in large volumes ranging from
Gigabytes to Terabytes, Petabytes, and even more.
Today, the data is not only generated by humans, but
large amounts of data is being generated by machines
and it surpasses human generated data. This size aspect
of data is referred to as Volume in the Big Data world.
Velocity refers to the speed at which the data is being
generated. Different applications have different latency
requirements and in today's competitive world, decision
makers want the necessary data/information in the least
amount of time as possible. Generally, in near real time

www.ijsret.org

768

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 4, Issue 7, July 2015

or real time in certain scenarios. In different fields and


different areas of technology, we see data getting
generated at different speeds. A few examples include
trading/stock exchange data, tweets on Twitter, status
updates/likes/shares on Facebook, and many others. This
speed aspect of data generation is referred to as Velocity
in the Big Data world.
Variety refers to the different formats in which the data
is being generated/stored. Different applications
generate/store the data in different formats. In today's
world, there are large volumes of unstructured data being
generated apart from the structured data getting
generated in enterprises. Until the advancements in Big
Data technologies, the industry didn't have any powerful
and reliable tools/technologies which can work with
such voluminous unstructured data that we see today. In
today's world, organizations not only need to rely on the
structured data from enterprise databases/warehouses,
they are also forced to consume lots of data that is being
generated both inside and outside of the enterprise like
click stream data, social media, etc. to stay competitive.
Apart from the traditional flat files, spreadsheets,
relational databases etc., we have a lot of unstructured
data stored in the form of images, audio files, video files,
web logs, sensor data, and many others. This aspect of
varied data formats is referred to as Variety in the Big
Data world.

III.

REALITY OF BIG DATA

Our capacity for big data era has never been so intense
furthermore, colossal following the time when the
creation of the data innovation in the mid nineteenth
century. As another sample, on 4 October 2012, the first
presidential level headed discussion between President
Barack Obama and Governor Mitt Romney activated
more than 10 million tweets inside of 2 hours [3].
Among every one of these tweets, the particular minutes
that produced the most dialogs really uncovered the
general population hobbies, for example, the dialogs
about Medicare and vouchers. Such online discussions
provide a new means to sense the public interests and
generate feedback in real-time, and are mostly appealing
compared to generic media, such as radio or TV
broadcasting.

IV.

PROBLEM STATEMENT

As we stated emerging growth in Big Data trend it is


very important to manage huge size of the data with
mining techniques. Whereas exiting mining algorithms
are tested or being used with medium size of the data

only. So in this paper Big Data Mining is going to be


proposed to handle big data processing operations.

V.

EXISTING APPROACHES

Right now, Big Data preparing for the most part relies
on upon parallel programming models like MapReduce,
and additionally giving a distributed computing stage of
Big Data administrations for people in general.
MapReduce is a bunch situated parallel figuring model.
There is still a certain crevice in execution with social
databases.
Enhancing the execution of MapReduce and improving
the ongoing way of expansive scale information
preparing have gotten a noteworthy measure of
consideration, with MapReduce parallel writing
computer programs being connected to numerous
machine learning and information mining calculations.
Information mining calculations generally need to look
over the preparation information for getting the
measurements to explain or streamline model
parameters. It calls for escalated registering to get to the
expansive scale information every now and again. To
enhance the productivity of calculations, Chu et al.
proposed a broadly useful parallel programming system,
which is pertinent to a substantial number of machine
learning calculations taking into account the
straightforward MapReduce programming model on
multicore processors.
Ten traditional information mining calculations are
acknowledged in the system, including by regional
standards weighted direct relapse, k-Means, logistic
relapse, gullible Bayes, direct bolster vector machines,
the free variable examination, Gaussian discriminant
investigation, desire expansion, and back-proliferation
neural systems [1]. With the examination of these
traditional machine learning calculations, we contend
that the computational operations in the calculation
learning procedure could be changed into a summation
operation on various preparing information sets.
Summation operations could be performed on distinctive
subsets freely and accomplish punishment executed
effectively on the MapReduce programming stage[1].
Along these lines, a vast scale information set could be
isolated into a few subsets and allocated to numerous
Mapper hubs.
At that point, different summation operations could be
performed on the Mapper hubs to gather middle of the
road results. At long last, learning calculations are

www.ijsret.org

769

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 4, Issue 7, July 2015

executed in parallel through consolidating summation on


Reduce hubs.

VI.

RESEARCH INITIATIVES

To tackle the Big Data challenges and seize the


opportunities afforded by the new, data driven
resolution, the US National Science Foundation (NSF),
under President Obama Administrations Big Data
initiative, announced the BIGDATA solicitation in 2012.
Such a federal initiative has resulted in a number of
winning projects to investigate the foundations for Big
Data management (led by the University of
Washington), analytical approaches for genomics-based
massive data computation (led by Brown University),
large scale machine learning techniques for highdimensional data sets that may be as large as 500,000
dimensions (led by Carnegie Mellon University), social
analytics for large scale scientific literatures (led by
Rutgers University), and several others. These projects
seek to develop methods, algorithms, frameworks, and
research infrastructures that allow us to bring the
massive amounts of data down to a human manageable
and interpretable scale. Other countries such as the
National Natural Science Foundation of China (NSFC)
are also catching up with national grants on Big Data
research.

VII.

PROPOSED SOLUTION

For a wise learning database framework [2] to handle


Enormous Data, the crucial key is proportional up to the
outstandingly expansive volume of information and give
medicines to the qualities highlighted by the previously
stated HACE hypothesis. A reasonable perspective of
the Big Data preparing structure, which incorporates
three levels from back to front with contemplations on
information getting to and figuring (Tier I), information
protection and area learning (Tier II), and Big Data
mining calculations (Tier III).
The difficulties at Tier I concentrate on information
getting to and number juggling processing systems.
Since Big Data are regularly put away at distinctive
areas and information volumes might consistently grow,
a viable figuring stage will need to take circulated huge
scale information stockpiling into thought for
registering. Case in point, regular information mining
calculations oblige all information to be stacked into the
primary memory, this, be that as it may, is turning into

an unmistakable specialized obstruction for Big Data on


the grounds that moving information crosswise over
diverse areas is costly (e.g., subject to serious system
correspondence and other IO expenses), regardless of the
possibility that we do have a super extensive primary
memory to hold all information for figuring.
The difficulties at Tier II base on semantics and area
learning for distinctive Big Data applications. Such data
can give extra advantages to the mining procedure, and
also add specialized boundaries to the Big Data access
(Tier I) and mining calculations (Tier III). Case in point,
contingent upon distinctive space applications, the
information protection and data sharing components[7]
between information makers and information customers
can be essentially diverse. Sharing sensor system
information for applications like water quality checking
may not be disheartened, while discharging and sharing
portable clients' area data is obviously not worthy for
dominant part, if not all, applications. In expansion to
the above protection issues, the application spaces can
likewise give extra data to advantage on the other hand
direct Big Data mining calculation outlines. For
instance, in business sector wicker container exchanges
information, every exchange is considered free and the
found learning is regularly spoke to by discovering
exceedingly corresponded things, perhaps as for diverse
fleeting and/or spatial confinements. In an informal
community, then again, clients are connected and offer
reliance structures. The learning is at that point spoke to
by client groups, pioneers in each gathering[6], and
social impact demonstrating, etc. In this way,
understanding semantics and application information is
critical for both low-level information access and for
abnormal state mining calculation plans.
At Tier III, the information mining difficulties focus on
calculation plans in handling the troubles raised by the
Enormous Data volumes, dispersed information
appropriations, and by perplexing and element
information attributes. The circle at Level III contains
three stages. Initially, scanty, heterogeneous,
indeterminate, deficient, and multisource information are
preprocessed[4] by information combination strategies.
Second, complex and dynamic information are mined
subsequent to preprocessing. Third, the worldwide

www.ijsret.org

770

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 4, Issue 7, July 2015

information got by nearby learning and model


combination is tried and significant data is feedback to
the preprocessing stage. At that point, the model and
parameters are balanced by criticism. In the entire
procedure, data sharing is not just a guarantee of smooth
advancement of every stage, additionally a reason for
Big Data handling.

VIII.

DATA MINING ALGORITHMS USING

In this paper two popular data mining algorithms are


using named as Apriori and FP growth algorithms to
manage big data analysis operations.
Apriori is an algorithm (as Fig:2 )for frequent item set
mining and association rule learning over transactional
databases. It proceeds by identifying the frequent
individual items in the database and extending them to
larger and larger item sets as long as those item sets
appear sufficiently often in the database.

Fig 3: FP Growth

Fig 4: Algorithm Comparison

IX.

Fig 2: Apriori Algorithm


FP-growth is a program to find frequent item sets (also
closed and maximal as well as generators) with the FPgrowth algorithm[5], which represents the transaction
database as a prefix tree which is enhanced with links
that organize the nodes into lists referring to the same
item. The search is carried out by projecting the prefix
tree, working recursively on the result, and pruning the
original tree. The implementation also supports filtering
for closed and maximal item sets with conditional item
set repositories as suggested although the approach used
in the program differs in as far as it used top-down
prefix trees rather than FP-trees. It does not cover the
clever implementation of FP-trees with two integer
arrays as suggested.

CONCLUSION

To investigate Big Data, we have examined a few


difficulties at the information, model, and framework
levels. To bolster Big Information mining, superior
registering stages are obliged, which force precise plans
to unleash the full force of the Big Data. At the
information level, the independent data sources and the
assortment of the information accumulation situations,
frequently bring about information with confused
conditions, for example, missing/unverifiable qualities.
In different circumstances, protection concerns, clamor,
and mistakes can be brought into the information, to
create changed information duplicates. Adding to a
protected and sound data sharing convention is a
noteworthy test. At the model level, the key test is to
produce worldwide models by joining by regional
standards found examples to shape a binding together
view. This obliges deliberately outlined calculations to
examine model connections between circulated
destinations, and circuit choices from numerous sources
to pick up a best model out of the Big Data. At the
framework level, the fundamental test is that a Big Data

www.ijsret.org

771

International Journal of Scientific Research Engineering & Technology (IJSRET), ISSN 2278 0882
Volume 4, Issue 7, July 2015

mining structure needs to consider complex connections


between tests, models, and information sources,
alongside their advancing changes with time and other
conceivable components.

REFERENCES
[1] C.T. Chu, S.K. Kim, Y.A. Lin, Y. Yu, G.R. Bradski,
A.Y. Ng, and K. Olukotun, Map-Reduce for Machine
Learning on Multicore,Proc. 20th Ann. Conf. Neural
Information Processing Systems (NIPS 06), pp. 281288, 2006.
[2] X. Wu, Building Intelligent Learning Database
Systems, AI Magazine, vol. 21, no. 3, pp. 61-67, 2000.
[3] Twitter Blog, Dispatch from the Denver Debate,
http://blog.twitter.com/2012/10/dispatch-from-denverdebate.html,Oct. 2012.

[4] D. Luo, C. Ding, and H. Huang, Parallelization with


Multiplicative Algorithms for Big Data Mining, Proc.
IEEE 12th Intl Conf. Data Mining, pp. 489-498, 2012.
[5] X. Wu and X. Zhu, Mining with Noise Knowledge:
Error-Aware Data Mining, IEEE Trans. Systems, Man
and Cybernetics, Part A, vol. 38, no. 4, pp. 917-932, July
2008.
[6] R. Chen, K. Sivakumar, and H. Kargupta,
Collective Mining of Bayesian Networks from
Distributed Heterogeneous Data, Knowledge and
Information Systems, vol. 6, no. 2, pp. 164-187, 2004.
[7] P. Domingos and G. Hulten, Mining High-Speed
Data Streams, Proc. Sixth ACM SIGKDD Intl Conf.
Knowledge Discovery and DataMining (KDD 00), pp.
71-80, 2000.

www.ijsret.org

772

S-ar putea să vă placă și