Sunteți pe pagina 1din 33

Data Mining with Big Data

Abstract
Big Data concern large-volume, complex, growing data sets
with multiple, autonomous sources. With the fast development
of networking, data storage, and the data collection capacity,
Big Data are now rapidly expanding in all science and
engineering domains, including physical, biological and
biomedical sciences. This paper presents a HACE theorem
that characterizes the features of the Big Data revolution, and
proposes a Big Data processing model, from the data mining
perspective. This data-driven model involves demand-driven
aggregation of information sources, mining and analysis, user
interest modelling, and security and privacy considerations. We
analyse the challenging issues in the data-driven model and
also in the Big Data revolution.

What is Big Data ?

It is the term for a collection of data sets so large


and complex that it becomes difficult to process

Data has exponential growth, both structured and


unstructured

Big Data Everywhere!


600+

Lots of data is being collected


and warehoused
Web data, e-commerce
purchases at department/
grocery stores
Bank/Credit Card
transactions
Social Network

1,500+
blog
posts
7,000+
photos
on flickr
700,000
+
Facebo
ok
updates

videos
on
YouTube

emails
sent

Data
velocit
y per
minute

US$
300,000
+ are

spent on
online
shopping

200
million
+

400,00
0+
tweets
on
Twitter

2
million
+
Google
search
queries

400,000
+
minutes
of
Skype
calling

Type of Data

Relational Data (Tables/Transaction/Legacy Data)


Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF),
Streaming Data
You can only scan the data once

What to do with these data?


Aggregation and Statistics
Data warehouse and OLAP
Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
Knowledge discovery
Data Mining
Statistical Modeling

How much data?

Google processes 20 PB a day (2008)


Wayback Machine has 3 PB + 100 TB/month (3/2009)
Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
CERNs Large Hydron Collider (LHC) generates 15 PB a
year
Data has grown tremendously.
This large amount of data is beyond the of software tools to
manage
Exploring the large volume of data and extracting useful
information and knowledge is a challenge, and sometimes, it
is almost infeasible

Existing System
The rise of Big Data applications where data collection has grown
tremendously and is beyond the ability of commonly used software
tools to capture, manage, and process within a tolerable elapsed
time.
The most fundamental challenge for Big Data applications is to
explore the large volumes of data and extract useful information or
knowledge for future actions.
In many situations, the knowledge extraction process has to be very
efficient and close to real time because storing all observed data is
nearly infeasible.
The unprecedented data volumes require an effective data analysis
and prediction platform to achieve fast response and real-time
classification for such Big Data.

Proposed System
HACE theorem to model Big Data characteristics.
The characteristics of HACE make it an extreme
challenge for discovering useful knowledge from the Big
Data.
Provide most relevant and most accurate social sensing
feedback.

HACE Theorem

H - Heterogeneous
A - Autonomous
C - Complex
E - Evolving

The HACE theorem suggests that the key


characteristics of the Big Data are
huge with heterogeneous and diverse data sources
autonomous with distributed and decentralized
control
complex and evolving in data and knowledge
associations.

Proposed System: Big Data CharacteristicsHACE theorem


Huge data with Heterogeneous and Diverse Dimensionality
Autonomous Sources with Distributed and Decentralizes
Control
Complex and Evolving Relationships

Mining Architecture

System Architecture

System Modules

Integrating and mining bio-data


Big Data Fast Response
Pattern matching and mining
Key technologies for integration and mining
Group influence and interactions

IMPLEMENTATION DETAILS
Applications of the k-Means Clustering
Algorithm
Determining the number of clusters in a set of
data
Setting up the experiments

Data Mining Challenges with Big Data


Big Data processing framework, which includes three
tiers from inside out with considerations on
Big Data Mining Platform(Tier I)
Big Data Semantics and Application
Knowledge(Tier II)
Big Data Mining Algorithms (Tier III)

Data accessing and computing (Tier I)


Computing Platform
Data
Computing Processors
Located at different Locations
Data Mining Task such as MapReduce
Data Volume
Support from Industrial Stockholders

Data privacy and domain knowledge (Tier II)


Semantics and Domain knowledge
Technical barriers to the Big Data access
The Important Concepts are:
Information Sharing and Data Privacy
Domain and Application Knowledge

Big Data mining algorithms (Tier III)


Concentrate on algorithm designs
Deals with problems raised by Big data Volumes,
distributed data distributions, and by complex and
dynamic data characteristics.
Important Concepts
Local Learning and Model Fusion for Multiple
Information Sources
Mining from Sparse, Uncertain, and Incomplete
Data
Mining Complex and Dynamic Data

Hadoop 2.0

HDFS Architecture

YARN-Map Reduce

YARN - A general purpose resource management system for Hadoop


to allow MapReduce and other data processing frameworks and
services
High Availability for HDFS
HDFS Federation
HDFS Snapshots
NFSv3 access to data in HDFS
Support for running Hadoop on Microsoft Windows
Binary Compatibility for MapReduce applications built on hadoop-1.x
Substantial amount of integration testing with rest of projects in the
ecosystem

Technologies Used

Operating System

Windows XP/7

Programming Language Software Version


Database

Java

JDK 1.7 or above

MYSQL

Expected Outputs/Results
Provide most relevant and most accurate social sensing feedback to better understand our society at real-time.
The reason for this architectural evolution is a more efficient resources management which fosters scalability of
clusters beyond several thousand nodes.
At the system level, the essential challenge is that a Big Data mining framework needs to consider
complex relationships between samples, models, and data sources, along with their evolving changes
with time and other possible factors.

We regard Big Data as an emerging trend and the need for Big Data mining is arising in all science and
engineering domains.

Conclusion

HACE theorem suggests that the key characteristics of


the Big Data are
Huge with Heterogeneous and diverse data sources,
Autonomous with distributed and decentralized
control, and
Complex and
Evolving in data and knowledge associations.
Such combined characteristics suggest that Big Data
requires a big mind to consolidate data for maximum
values

References
Xindong Wu, Fellow, IEEE, Xingquan Zhu, Senior Member, IEEE, Gong-Qing
Wu, and Wei Ding, Senior Member, IEEE, Data Mining with Big Data, IEEE
TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
26, NO. 1, JANUARY 2014.
F. Michel, How Many Photos Are Uploaded to Flickr Every Day and Month?
http://www.flickr.com/photos/franckmichel/6855169886/, 2012.
J. Mervis, U.S. Science Policy: Agencies Rally to Tackle Big Data, Science, vol.
336, no. 6077, p. 22, 2012.
Nature Editorial, Community Cleverness Required, Nature, vol. 455, no. 7209,
p. 1, Sept. 2008.
C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C.Kozyrakis,
Evaluating MapReduce for Multi-Core and Multiprocessor Systems, Proc. IEEE
13th Intl Symp. High Performance Computer Architecture (HPCA 07), pp. 13-24,
2007.
IBM What Is Big Data: Bring Big Data to the Enterprise, http://www01.ibm.com/software/data/bigdata/, IBM, 2012.

References cont..
Labrinidis and H. Jagadish, Challenges and Opportunities with Big Data, Proc.
VLDB Endowment, vol. 5, no. 12, 2032-2033, 2012.
Twitter Blog, Dispatch from the Denver Debate,
http://blog.twitter.com/2012/10/dispatch-from-denver-debate.html, Oct. 2012.
Rajaraman and J. Ullman, Mining of Massive Data Sets. Cambridge Univ. Press,
2011.
Reed, D. Thompson, W. Majid, and K. Wagstaff, Real Time Machine Learning to
Find Fast Transient Radio Anomalies: A Semi-Supervised Approach Combining
Detection and RFI Excision, Proc. Intl Astronomical Union Symp. Time Domain
Astronomy, Sept. 2011.
P. Dewdney, P. Hall, R. Schilizzi, and J. Lazio, The Square Kilometre Array,
Proc. IEEE, vol. 97, no. 8, pp. 1482-1496, Aug. 2009.
P. Domingos and G. Hulten, Mining High-Speed Data Streams, Proc. Sixth
ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining (KDD 00), pp.
71-80, 2000.

References Cont..
G. Duncan, Privacy by Design, Science, vol. 317, pp. 1178-1179, 2007.
B. Efron, Missing Data, Imputation, and the Bootstrap, J. Am. Statistical Assoc., vol.
89, no. 426, pp. 463-475, 1994.
Ghoting and E. Pednault, Hadoop-ML: An Infrastructure for the Rapid
Implementation of Parallel Reusable Analytics, Proc. Large-Scale Machine Learning:
Parallelism and Massive Data Sets Workshop (NIPS 09), 2009.
D. Gillick, A. Faria, and J. DeNero, MapReduce: Distributed Computing for Machine
Learning, Berkley, Dec. 2006.

List of Publications
Data Mining and Knowledge Discovery for Big Data in Health
Informatics
Big Data in Agriculture (Still in progress)

Thank
You

Queries ?

S-ar putea să vă placă și