Data Minig With Big Data

Data Mining with Big Data
Abstract
Big Data concern large-volume, complex, growing data sets
with multiple, autonomous sources. With the fast development
of networking, data storage, and the data collection capacity,
Big Data are now rapidly expanding in all science and
engineering domains, including physical, biological and
biomedical sciences. This paper presents a HACE theorem
that characterizes the features of the Big Data revolution, and
proposes a Big Data processing model, from the data mining
perspective. This data-driven model involves demand-driven
aggregation of information sources, mining and analysis, user
interest modelling, and security and privacy considerations. We
analyse the challenging issues in the data-driven model and
also in the Big Data revolution.
What is Big Data ?
It is the term for a collection of data sets so large

and complex that it becomes difficult to process
Data has exponential growth, both structured and

unstructured
Big Data Everywhere!

600+
Lots of data is being collected

and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Social Network
1,500+
blog
posts
7,000+
photos
on flickr
700,000
+
Facebo
ok
updates
videos
on
YouTube
emails
sent
Data
velocit
y per
minute
US$
300,000
+ are
spent on
online
shopping
200
million
+
400,00
0+
tweets
on
Twitter
2
million
+
Google
search
queries
400,000
+
minutes
of
Skype
calling
Type of Data
Relational Data (Tables/Transaction/Legacy Data)

Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF),
Streaming Data
You can only scan the data once
What to do with these data?

Aggregation and Statistics
Data warehouse and OLAP
Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
Knowledge discovery
Data Mining
Statistical Modeling
How much data?
Google processes 20 PB a day (2008)

Wayback Machine has 3 PB + 100 TB/month (3/2009)
Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
CERNs Large Hydron Collider (LHC) generates 15 PB a
year
Data has grown tremendously.
This large amount of data is beyond the of software tools to
manage
Exploring the large volume of data and extracting useful
information and knowledge is a challenge, and sometimes, it
is almost infeasible
Existing System
The rise of Big Data applications where data collection has grown
tremendously and is beyond the ability of commonly used software
tools to capture, manage, and process within a tolerable elapsed
time.
The most fundamental challenge for Big Data applications is to
explore the large volumes of data and extract useful information or
knowledge for future actions.
In many situations, the knowledge extraction process has to be very
efficient and close to real time because storing all observed data is
nearly infeasible.
The unprecedented data volumes require an effective data analysis
and prediction platform to achieve fast response and real-time
classification for such Big Data.
Proposed System
HACE theorem to model Big Data characteristics.
The characteristics of HACE make it an extreme
challenge for discovering useful knowledge from the Big
Data.
Provide most relevant and most accurate social sensing
feedback.
HACE Theorem
H - Heterogeneous
A - Autonomous
C - Complex
E - Evolving
The HACE theorem suggests that the key

characteristics of the Big Data are
huge with heterogeneous and diverse data sources
autonomous with distributed and decentralized
control
complex and evolving in data and knowledge
associations.
Proposed System: Big Data CharacteristicsHACE theorem

Huge data with Heterogeneous and Diverse Dimensionality
Autonomous Sources with Distributed and Decentralizes
Control
Complex and Evolving Relationships
Mining Architecture
System Architecture
System Modules
Integrating and mining bio-data

Big Data Fast Response
Pattern matching and mining
Key technologies for integration and mining
Group influence and interactions
IMPLEMENTATION DETAILS
Applications of the k-Means Clustering
Algorithm
Determining the number of clusters in a set of
data
Setting up the experiments
Data Mining Challenges with Big Data

Big Data processing framework, which includes three
tiers from inside out with considerations on
Big Data Mining Platform(Tier I)
Big Data Semantics and Application
Knowledge(Tier II)
Big Data Mining Algorithms (Tier III)
Data accessing and computing (Tier I)

Computing Platform
Data
Computing Processors
Located at different Locations
Data Mining Task such as MapReduce
Data Volume
Support from Industrial Stockholders
Data privacy and domain knowledge (Tier II)

Semantics and Domain knowledge
Technical barriers to the Big Data access
The Important Concepts are:
Information Sharing and Data Privacy
Domain and Application Knowledge
Big Data mining algorithms (Tier III)

Concentrate on algorithm designs
Deals with problems raised by Big data Volumes,
distributed data distributions, and by complex and
dynamic data characteristics.
Important Concepts
Local Learning and Model Fusion for Multiple
Information Sources
Mining from Sparse, Uncertain, and Incomplete
Data
Mining Complex and Dynamic Data
Hadoop 2.0
HDFS Architecture
YARN-Map Reduce
YARN - A general purpose resource management system for Hadoop

to allow MapReduce and other data processing frameworks and
services
High Availability for HDFS
HDFS Federation
HDFS Snapshots
NFSv3 access to data in HDFS
Support for running Hadoop on Microsoft Windows
Binary Compatibility for MapReduce applications built on hadoop-1.x
Substantial amount of integration testing with rest of projects in the
ecosystem
Technologies Used
Operating System
Windows XP/7
Programming Language Software Version

Database
Java
JDK 1.7 or above
MYSQL
Expected Outputs/Results
Provide most relevant and most accurate social sensing feedback to better understand our society at real-time.
The reason for this architectural evolution is a more efficient resources management which fosters scalability of
clusters beyond several thousand nodes.
At the system level, the essential challenge is that a Big Data mining framework needs to consider
complex relationships between samples, models, and data sources, along with their evolving changes
with time and other possible factors.
We regard Big Data as an emerging trend and the need for Big Data mining is arising in all science and
engineering domains.
Conclusion
HACE theorem suggests that the key characteristics of

the Big Data are
Huge with Heterogeneous and diverse data sources,
Autonomous with distributed and decentralized
control, and
Complex and
Evolving in data and knowledge associations.
Such combined characteristics suggest that Big Data
requires a big mind to consolidate data for maximum
values
References
Xindong Wu, Fellow, IEEE, Xingquan Zhu, Senior Member, IEEE, Gong-Qing
Wu, and Wei Ding, Senior Member, IEEE, Data Mining with Big Data, IEEE
TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
26, NO. 1, JANUARY 2014.
F. Michel, How Many Photos Are Uploaded to Flickr Every Day and Month?
http://www.flickr.com/photos/franckmichel/6855169886/, 2012.
J. Mervis, U.S. Science Policy: Agencies Rally to Tackle Big Data, Science, vol.
336, no. 6077, p. 22, 2012.
Nature Editorial, Community Cleverness Required, Nature, vol. 455, no. 7209,
p. 1, Sept. 2008.
C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C.Kozyrakis,
Evaluating MapReduce for Multi-Core and Multiprocessor Systems, Proc. IEEE
13th Intl Symp. High Performance Computer Architecture (HPCA 07), pp. 13-24,
2007.
IBM What Is Big Data: Bring Big Data to the Enterprise, http://www01.ibm.com/software/data/bigdata/, IBM, 2012.
References cont..
Labrinidis and H. Jagadish, Challenges and Opportunities with Big Data, Proc.
VLDB Endowment, vol. 5, no. 12, 2032-2033, 2012.
Twitter Blog, Dispatch from the Denver Debate,
http://blog.twitter.com/2012/10/dispatch-from-denver-debate.html, Oct. 2012.
Rajaraman and J. Ullman, Mining of Massive Data Sets. Cambridge Univ. Press,
2011.
Reed, D. Thompson, W. Majid, and K. Wagstaff, Real Time Machine Learning to
Find Fast Transient Radio Anomalies: A Semi-Supervised Approach Combining
Detection and RFI Excision, Proc. Intl Astronomical Union Symp. Time Domain
Astronomy, Sept. 2011.
P. Dewdney, P. Hall, R. Schilizzi, and J. Lazio, The Square Kilometre Array,
Proc. IEEE, vol. 97, no. 8, pp. 1482-1496, Aug. 2009.
P. Domingos and G. Hulten, Mining High-Speed Data Streams, Proc. Sixth
ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining (KDD 00), pp.
71-80, 2000.
References Cont..
G. Duncan, Privacy by Design, Science, vol. 317, pp. 1178-1179, 2007.
B. Efron, Missing Data, Imputation, and the Bootstrap, J. Am. Statistical Assoc., vol.
89, no. 426, pp. 463-475, 1994.
Ghoting and E. Pednault, Hadoop-ML: An Infrastructure for the Rapid
Implementation of Parallel Reusable Analytics, Proc. Large-Scale Machine Learning:
Parallelism and Massive Data Sets Workshop (NIPS 09), 2009.
D. Gillick, A. Faria, and J. DeNero, MapReduce: Distributed Computing for Machine
Learning, Berkley, Dec. 2006.
List of Publications
Data Mining and Knowledge Discovery for Big Data in Health
Informatics
Big Data in Agriculture (Still in progress)
Thank
You
Queries ?

Data Minig With Big Data

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Data Minig With Big Data

Încărcat de

Drepturi de autor:

Formate disponibile

Data Mining with Big Data

What is Big Data ?

It is the term for a collection of data sets so large

Data has exponential growth, both structured and

Big Data Everywhere!

Lots of data is being collected

Relational Data (Tables/Transaction/Legacy Data)

What to do with these data?

How much data?

Google processes 20 PB a day (2008)

The HACE theorem suggests that the key

Proposed System: Big Data CharacteristicsHACE theorem

Integrating and mining bio-data

Data Mining Challenges with Big Data

Data accessing and computing (Tier I)

Data privacy and domain knowledge (Tier II)

Big Data mining algorithms (Tier III)

YARN - A general purpose resource management system for Hadoop

Programming Language Software Version

JDK 1.7 or above

HACE theorem suggests that the key characteristics of

S-ar putea să vă placă și