Sunteți pe pagina 1din 21

Knowledge Mining and Big Data

Ari Visa 2015


Outline
• Definitions
• Backgrounds
• What is data mining?
• Data Mining: On what kind of data?
• Data mining functionality
• Are all the patterns interesting?
• Classification of data mining systems
• Major issues in data mining
Definitions
• Data: Facts and things certainly known. Data are any
facts, numbers, or text that can be processed by a
computer.
• Information: News and knowledge given. The patterns,
associations, or relationships among all this data can
provide information.
• Knowledge: Understanding, range of information,
familiarity gained by experience. Information can be
converted into knowledge about historical patterns and
future trends.
• Experience: Process of gaining knowledge or skill by
doing and seeing things
Definitions
• Big data include data sets with sizes beyond the ability of
commonly used software tools to capture, curate, manage,
and process the data within a tolerable elapsed time [1].
• Limits on the size of data sets are a constantly moving
target, as of 2012 ranging from a few dozen terabytes to
many petabytes of data in a single data set.
• Doug Laney defined data growth challenges and
opportunities as being three-dimensional, i.e. increasing
volume (amount of data), velocity (speed of data in and
out), and variety (range of data types and
sources).[Douglas, Laney. "3D Data Management:
Controlling Data Volume, Velocity and Variety". Gartner.
Retrieved 6 February 2001]
Definitions
• Databases <-> On-line processing

• Data mining:
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) information or patterns from data in large databases
– Organizations integrate their various databases into
data warehouses. Data warehousing is defined as a
process of centralized data management and retrieval
(data capture, processing power, data transmission,
and storage capabilities ).
• Knowledge mining (knowledge discovery in databases):
– Extraction of interesting (previously unknown and potentially useful)
models from data in large databases
Definitions
• Data stream: A sequence of digitally encoded
signals used to represent information in
transmission [Federal Standard 1037C data
stream].

• Data Stream Mining is the process of


extracting knowledge structures from
continuous, rapid data records.
Definitions
Multiples of bytes

Decimal
Value Metric
1000 kB kilobyte
10002 MB megabyte
10003 GB gigabyte
10004 TB terabyte
10005 PB petabyte
10006 EB exabyte
10007 ZB zettabyte
10008 YB yottabyte
Binary
Value JEDEC IEC
1024 KB Kilobyte KiB kibibyte
10242 MB Megabyte MiB mebibyte
10243 GB Gigabyte GiB gibibyte
10244 TB Terabyte TiB tebibyte
10245 PiB pebibyte
10246 EiB exbibyte
10247 ZiB zebibyte
10248 YiB yobibyte
See also: Bit and Byte prefixes
Orders of magnitude of data
Background
• Big data are difficult to work with using most relational
database management systems and desktop statistics
and visualization packages, requiring instead
"massively parallel software running on tens,
hundreds, or even thousands of servers“
• The trend to larger data sets is due to the additional
information derivable from analysis of a single large set
of related data, as compared to separate smaller sets
with the same total amount of data, allowing
correlations to be found to "spot business trends,
determine quality of research, prevent diseases, link
legal citations, combat crime, and determine real-time
roadway traffic conditions."
Background- Evolution of Database
Technology
• 1960s: Data collection, database creation, IMS and
network DBMS
• 1970s: Relational data model, relational DBMS
implementation
• 1980s: RDBMS, advanced data models (extended-
relational, OO, deductive, etc.) and application-oriented
DBMS (spatial, scientific, engineering, etc.)
• 1990s—2000s: Data mining and data warehousing,
multimedia databases, and Web databases
• 2000s – 2020s: Cloud based distributed databases, HaDoop
Background
• Business Intelligence uses descriptive statistics
with data with high information density to
measure things, detect trends etc.;
• Big Data uses inductive statistics with data [4]
with low information density whose huge
volume allow to infer laws (regressions…) and
thus giving (with the limits of inference
reasoning) to Big Data some predictive
capabilities
Background
• Big data requires exceptional technologies to
efficiently process large quantities of data
within tolerable elapsed times.
• Real or near-real time information delivery is
one of the defining characteristics of big data
analytics.
Background
• In 2000, Seisint Inc. develops C++ based distributed file sharing framework for data storage and querying.
Structured, semi-structured and/or unstructured data is stored and distributed across multiple servers.
Querying of data is done by modified C++ called ECL which uses apply scheme on read method to create
structure of stored data during time of query.
• In 2004 LexisNexis acquired Seisint Inc. and 2008 acquired ChoicePoint, Inc. and their high speed parallel
processing platform. The two platforms were merged into HPCC Systems and in 2011 was open sourced
under Apache v2.0 License. Currently HPCC and Quantcast File System are the only publicly available
platforms to exceed multiple exabytes of data.
• In 2004, Google published a paper on a process called MapReduce that used such an architecture. The
MapReduce framework provides a parallel processing model and associated implementation to process
huge amount of data. With MapReduce, queries are split and distributed across parallel nodes and
processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). The
framework was very successful,[51] so others wanted to replicate the algorithm. Therefore, an
implementation of the MapReduce framework was adopted by an Apache open source project named
Hadoop.
• MIKE2.0 is an open approach to information management that acknowledges the need for revisions due
to big data implications in an article titled "Big Data Solution Offering".The methodology addresses
handling big data in terms of useful permutations of data sources, complexity in interrelationships, and
difficulty in deleting (or modifying) individual records.
• Recent studies show that the use of a multiple layer architecture is an option for dealing with big data. The
Distributed Parallel architecture distributes data across multiple processing units and parallel processing
units provide data much faster, by improving processing speeds. This type of architecture inserts data into
a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type of
framework looks to make the processing power transparent to the end user by using a front end
application server.
Definitions
• The Apache Hadoop software library is a framework
that allows for the distributed processing of large data
sets across clusters of computers using simple
programming models. It is designed to scale up from
single servers to thousands of machines, each offering
local computation and storage.
• Hadoop is a rapidly evolving ecosystem of components
for implementing the Google MapReduce algorithms
[3] in a scalable fashion on commodity hardware.
Hadoop enables users to store and process large
volumes of data and analyze it in ways not previously
possible with less scalable solutions or standard SQL-
based approaches.
Definitions
• Hadoop is a highly scalable compute and storage
platform. While most users will not initially
deploy servers numbered in the hundreds or
thousands, Dell recommends following the design
principles that drive large, hyper-scale
deployments. This ensures that as you start with
a small Hadoop environment, you can easily scale
that environment without rework to existing
servers, software, deployment strategies, and
network connectivity [2].
Definitions
• NameNode -The NameNode is the central location for information about the file system
deployed in a Hadoop environment. An environment can have one or two NameNodes,
configured to provide minimal redundancy between the NameNodes. The NameNode is
contacted by clients of the Hadoop Distributed File System (HDFS) to locate information
within the file system and provide updates for data they have added, moved, manipulated, or
deleted.
• DataNode – DataNodes make up the majority of the servers contained in a Hadoop
environment. Common Hadoop environments will have more than one DataNode, and
oftentimes they will number in the hundreds based on capacity and performance needs. The
DataNode serves two functions: It contains a portion of the data in the HDFS and it acts as a
compute platform for running jobs, some of which will utilize the local data within the HDFS.
• EdgeNode – The EdgeNode is the access point for the external applications, tools, and
users that need to utilize the Hadoop environment. The EdgeNode sits between the Hadoop
cluster and the corporate network to provide access control, policy enforcement, logging,
and gateway services to the Hadoop environment. A typical Hadoop environment will have a
minimum of one EdgeNode and more based on performance needs.
Definitions
Definitions
Definitions
Definitions
Definitions
Definitions
• A MapReduce program is composed of a Map()
procedure that performs filtering and sorting and
a Reduce() procedure that performs a summary
operation.The "MapReduce System" orchestrates
the processing by marshalling the distributed
servers, running the various tasks in parallel,
managing all communications and data transfers
between the various parts of the system, and
providing for redundancy and fault tolerance.

S-ar putea să vă placă și