Documente Academic
Documente Profesional
Documente Cultură
Abstract
Big Data concern large-volume, complex, growing data sets
with multiple, autonomous sources. With the fast development
of networking, data storage, and the data collection capacity,
Big Data are now rapidly expanding in all science and
engineering domains, including physical, biological and
biomedical sciences. This paper presents a HACE theorem
that characterizes the features of the Big Data revolution, and
proposes a Big Data processing model, from the data mining
perspective. This data-driven model involves demand-driven
aggregation of information sources, mining and analysis, user
interest modelling, and security and privacy considerations. We
analyse the challenging issues in the data-driven model and
also in the Big Data revolution.
1,500+
blog
posts
7,000+
photos
on flickr
700,000
+
Facebo
ok
updates
videos
on
YouTube
emails
sent
Data
velocit
y per
minute
US$
300,000
+ are
spent on
online
shopping
200
million
+
400,00
0+
tweets
on
Twitter
2
million
+
Google
search
queries
400,000
+
minutes
of
Skype
calling
Type of Data
Existing System
The rise of Big Data applications where data collection has grown
tremendously and is beyond the ability of commonly used software
tools to capture, manage, and process within a tolerable elapsed
time.
The most fundamental challenge for Big Data applications is to
explore the large volumes of data and extract useful information or
knowledge for future actions.
In many situations, the knowledge extraction process has to be very
efficient and close to real time because storing all observed data is
nearly infeasible.
The unprecedented data volumes require an effective data analysis
and prediction platform to achieve fast response and real-time
classification for such Big Data.
Proposed System
HACE theorem to model Big Data characteristics.
The characteristics of HACE make it an extreme
challenge for discovering useful knowledge from the Big
Data.
Provide most relevant and most accurate social sensing
feedback.
HACE Theorem
H - Heterogeneous
A - Autonomous
C - Complex
E - Evolving
Mining Architecture
System Architecture
System Modules
IMPLEMENTATION DETAILS
Applications of the k-Means Clustering
Algorithm
Determining the number of clusters in a set of
data
Setting up the experiments
Hadoop 2.0
HDFS Architecture
YARN-Map Reduce
Technologies Used
Operating System
Windows XP/7
Java
MYSQL
Expected Outputs/Results
Provide most relevant and most accurate social sensing feedback to better understand our society at real-time.
The reason for this architectural evolution is a more efficient resources management which fosters scalability of
clusters beyond several thousand nodes.
At the system level, the essential challenge is that a Big Data mining framework needs to consider
complex relationships between samples, models, and data sources, along with their evolving changes
with time and other possible factors.
We regard Big Data as an emerging trend and the need for Big Data mining is arising in all science and
engineering domains.
Conclusion
References
Xindong Wu, Fellow, IEEE, Xingquan Zhu, Senior Member, IEEE, Gong-Qing
Wu, and Wei Ding, Senior Member, IEEE, Data Mining with Big Data, IEEE
TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
26, NO. 1, JANUARY 2014.
F. Michel, How Many Photos Are Uploaded to Flickr Every Day and Month?
http://www.flickr.com/photos/franckmichel/6855169886/, 2012.
J. Mervis, U.S. Science Policy: Agencies Rally to Tackle Big Data, Science, vol.
336, no. 6077, p. 22, 2012.
Nature Editorial, Community Cleverness Required, Nature, vol. 455, no. 7209,
p. 1, Sept. 2008.
C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C.Kozyrakis,
Evaluating MapReduce for Multi-Core and Multiprocessor Systems, Proc. IEEE
13th Intl Symp. High Performance Computer Architecture (HPCA 07), pp. 13-24,
2007.
IBM What Is Big Data: Bring Big Data to the Enterprise, http://www01.ibm.com/software/data/bigdata/, IBM, 2012.
References cont..
Labrinidis and H. Jagadish, Challenges and Opportunities with Big Data, Proc.
VLDB Endowment, vol. 5, no. 12, 2032-2033, 2012.
Twitter Blog, Dispatch from the Denver Debate,
http://blog.twitter.com/2012/10/dispatch-from-denver-debate.html, Oct. 2012.
Rajaraman and J. Ullman, Mining of Massive Data Sets. Cambridge Univ. Press,
2011.
Reed, D. Thompson, W. Majid, and K. Wagstaff, Real Time Machine Learning to
Find Fast Transient Radio Anomalies: A Semi-Supervised Approach Combining
Detection and RFI Excision, Proc. Intl Astronomical Union Symp. Time Domain
Astronomy, Sept. 2011.
P. Dewdney, P. Hall, R. Schilizzi, and J. Lazio, The Square Kilometre Array,
Proc. IEEE, vol. 97, no. 8, pp. 1482-1496, Aug. 2009.
P. Domingos and G. Hulten, Mining High-Speed Data Streams, Proc. Sixth
ACM SIGKDD Intl Conf. Knowledge Discovery and Data Mining (KDD 00), pp.
71-80, 2000.
References Cont..
G. Duncan, Privacy by Design, Science, vol. 317, pp. 1178-1179, 2007.
B. Efron, Missing Data, Imputation, and the Bootstrap, J. Am. Statistical Assoc., vol.
89, no. 426, pp. 463-475, 1994.
Ghoting and E. Pednault, Hadoop-ML: An Infrastructure for the Rapid
Implementation of Parallel Reusable Analytics, Proc. Large-Scale Machine Learning:
Parallelism and Massive Data Sets Workshop (NIPS 09), 2009.
D. Gillick, A. Faria, and J. DeNero, MapReduce: Distributed Computing for Machine
Learning, Berkley, Dec. 2006.
List of Publications
Data Mining and Knowledge Discovery for Big Data in Health
Informatics
Big Data in Agriculture (Still in progress)
Thank
You
Queries ?