Sunteți pe pagina 1din 12

Unit-III

Cluster Computing
Introduction
A computer cluster consists of a set of loosely connected computers that work together so that
in many respects they can be viewed as a single system.
The components of a cluster are usually connected to each other through fast local area networks,
each node (computer used as a server) running its own instance of an operating system.
Computer clusters emerged as a result of convergence of a number of computing trends
including the availability of low cost microprocessors, high speed networks, and software for
high performance distributed computing.
Clusters are usually deployed to improve performance and availability over that of a single
computer, while typically being much more cost-effective than single computers of comparable
speed or availability.
Computer clusters have a wide range of applicability and deployment, ranging from small
business clusters with a handful of nodes to some of the fastest supercomputers in the world.

Basic concepts
The desire to get more computing horsepower and better reliability by orchestrating a
number of low cost commercial off-the-shelf computers has given rise to a variety of
architectures and configurations.
The computer clustering approach usually (but not always) connects a number of readily
available computing nodes (e.g. personal computers used as servers) via a fast local area
network. The activities of the computing nodes are orchestrated by "clustering middleware", a
software layer that sits atop the nodes and allows the users to treat the cluster as by and large one
cohesive computing unit, e.g. via a single system image concept.
Computer clustering relies on a centralized management approach which makes the nodes
available as orchestrated shared servers. It is distinct from other approaches such as peer to peer
or grid computing which also use many nodes, but with a far more distributed nature.
A computer cluster may be a simple two-node system which just connects two personal
computers, or may be a very fast supercomputer. A basic approach to building a cluster is that of
a Beowulf cluster which may be built with a few personal computers to produce a cost-effective
alternative to traditional high performance computing. An early project that showed the viability
of the concept was the 133 nodes Stone Soupercomputer.[4] The developers used Linux, the
Parallel Virtual Machine toolkit and the Message Passing Interface library to achieve high
performance at a relatively low cost.
Although a cluster may consist of just a few personal computers connected by a simple
network, the cluster architecture may also be used to achieve very high levels of performance.
The TOP500 organization's semiannual list of the 500 fastest supercomputers often includes
many clusters, e.g. the world's fastest machine in 2011 was the K computer which has a
distributed memory, cluster architecture.[6][7]

Attributes of clusters
Computer clusters may be configured for different purposes ranging from general purpose
business needs such as web-service support, to computation-intensive scientific calculations. In
either case, the cluster may use a high-availability approach. Note that the attributes described
below are not exclusive and a "compute cluster" may also use a high-availability approach, etc.

A load balancing cluster with two servers and 4 user stations


"Load-balancing" clusters are configurations in which cluster-nodes share computational
workload to provide better overall performance. For example, a web server cluster may assign
different queries to different nodes, so the overall response time will be optimized. However,
approaches to load-balancing may significantly differ among applications, e.g. a highperformance cluster used for scientific computations would balance load with different
algorithms from a web-server cluster which may just use a simple round-robin method by
assigning each new request to a different node.
"Computer clusters" are used for computation-intensive purposes, rather than handling IOoriented operations such as web service or databases.For instance, a computer cluster might
support computational simulations of weather or vehicle crashes. Very tightly coupled computer
clusters are designed for work that may approach "supercomputing".
"High-availability clusters" (also known as failover clusters, or HA clusters) improve the
availability of the cluster approach. They operate by having redundant nodes, which are then
used to provide service when system components fail. HA cluster implementations attempt to use
redundancy of cluster components to eliminate single points of failure. There are commercial
implementations of High-Availability clusters for many operating systems. The Linux-HA
project is one commonly used free software HA package for the Linux operating system.

Design and configuration


One of the issues in designing a cluster is how tightly coupled the individual nodes may be. For
instance, a single computer job may require frequent communication among nodes: this implies
that the cluster shares a dedicated network, is densely located, and probably has homogenous
nodes. The other extreme is where a computer job uses one or few nodes, and needs little or no
inter-node communication, approaching grid computing.
For Instance in a Beowulf system, the application programs never see the computational
nodes (also called slave computers) but only interact with the "Master" which is a specific
computer handling the scheduling and management of the slaves. In a typical implementation the
Master has two network interfaces, one that communicates with the private Beowulf network for
the slaves, the other for the general purpose network of the organization. The slave computers
typically have their own version of the same operating system, and local memory and disk space.
However, the private slave network may also have a large and shared file server that stores
global persistent data, accessed by the slaves as needed.

By contrast, the special purpose 144 node DEGIMA cluster is tuned to running astrophysical
N-body simulations using the Multiple-Walk parallel treecode, rather than general purpose
scientific computations.
Due to the increasing computing power of each generation of game consoles, a novel use has
emerged where they are repurposed into High-performance computing (HPC) clusters. Some
examples of game console clusters are Sony PlayStation clusters and Microsoft Xbox clusters.
Another example of consumer game product is the Nvidia Tesla Personal Supercomputer
workstation, which uses multiple graphics accelerator processor chips.
Computer clusters have historically run on separate physical computers with the same
operating system. With the advent of virtualization, the cluster nodes may run on separate
physical computers with different operating systems which are painted above with a virtual layer
to look similar. The cluster may also be virtualized on various configurations as maintenance
takes place. An example implementation is Xen as the virtualization manager with Linux-HA.[11]

Cluster Network Computing Architecture


The topic of clustering draws a fair amount of interest from the computer networking
community, but the concept is a generic one. To use a non-technical analogy: saying you are
interested in clustering is like saying you are interested in food. Does your interest lie in
cooking? In eating? Perhaps you are a farmer interested primarily in growing food, or a
restaurant critic, or a nutritionist. As with food, a good explanation of clustering depends on your
situation.
The computing world uses the term "clustering" in at least two distinct ways. For one,
"clustering" or "cluster analysis" refers to algorithmic approaches of determining similarity
among objects. This kind of clustering might be very appealing if you like math... but this is not
the sort of clustering we mean in the networking sense.

What Is Cluster Computing?


In a nutshell, network clustering connects otherwise independent computers to work together
in some coordinated fashion. Because clustering is a term used broadly, the hardware
configuration of clusters varies substantially depending on the networking technologies chosen
and the purpose (the so-called "computational mission") of the system. Clustering hardware
comes in three basic flavors: so-called "shared disk," "mirrored disk," and "shared nothing"
configurations.

Shared Disk Clusters


One approach to clustering utilizes central I/O devices accessible to all computers ("nodes")
within the cluster. We call these systems shared-disk clusters as the I/O involved is typically disk
storage for normal files and/or databases. Shared-disk cluster technologies include Oracle
Parallel Server (OPS)and IBM's HACMP.
Shared-disk clusters rely on a common I/O bus for disk access but do not require shared
memory. Because all nodes may concurrently write to or cache data from the central disks, a
synchronization mechanism must be used to preserve coherence of the system. An independent
piece of cluster software called the "distributed lock manager" assumes this role.
Shared-disk clusters support higher levels of system availability: if one node fails, other nodes
need not be affected. However, higher availability comes at a cost of somewhat reduced

performance in these systems because of overhead in using a lock manager and the potential
bottlenecks of shared hardware generally. Shared-disk clusters make up for this shortcoming
with relatively good scaling properties: OPS and HACMP support eight-node systems,

Shared Nothing Clusters


A second approach to clustering is dubbed shared-nothing because it does not involve concurrent
disk accesses from multiple nodes. (In other words, these clusters do not require a distributed
lock manager.) Shared-nothing cluster solutions include Microsoft Cluster Server (MSCS).
MSCS is an atypical example of a shared nothing cluster in several ways. MSCS clusters use a
shared SCSI connection between the nodes, that naturally leads some people to believe this is a
shared-disk solution. But only one server (the one that owns the quorum resource) needs the
disks at any given time, so no concurrent data access occurs. MSCS clusters also typically
include only two nodes, whereas shared nothing clusters in general can scale to hundreds of
nodes.

Mirrored Disk Clusters


Mirrored-disk cluster solutions include Legato's Vinca. Mirroring involves replicating all
application data from primary storage to a secondary backup (perhaps at a remote location) for
availability purposes. Replication occurs while the primary system is active, although the
mirrored backup system -- as in the case of Vinca -- typically does not perform any work outside
of its role as a passive standby. If a failure occurs in the primary system, a failover process
transfers control to the secondary system. Failover can take some time, and applications can lose
state information when they are reset, but mirroring enables a fairly fast recovery scheme
requiring little operator intervention. Mirrored-disk clusters typically include just two nodes.

Data sharing and communication


Data sharing
As the early computer clusters were appearing during the 1970s, so were supercomputers. One
of the elements that distinguished the two classes at that time was that the early supercomputers
relied on shared memory. To date clusters do not typically use physically shared memory, while
many supercomputer architectures have also abandoned it.
However, the use of a clustered file system is essential in modern computer clusters.
Examples include the IBM General Parallel File System, Microsoft's Cluster Shared Volumes or
the Oracle Cluster File System.

Message passing and communication


Two widely used approaches for communication between cluster nodes are MPI, the Message
Passing Interface and PVM, the Parallel Virtual Machine.
PVM was developed at the Oak Ridge National Laboratory around 1989 before MPI was
available. PVM must be directly installed on every cluster node and provides a set of software
libraries that paint the node as a "parallel virtual machine". PVM provides a run-time
environment for message-passing, task and resource management, and fault notification. PVM
can be used by user programs written in C, C++, or Fortran, etc.

MPI emerged in the early 1990s out of discussions between 40 organizations. The initial effort
was supported by ARPA and National Science Foundation. Rather than starting anew, the design
of MPI drew on various features available in commercial systems of the time. The MPI
specifications then gave rise to specific implementations. MPI implementations typically use
TCP/IP and socket connections. MPI is now a widely available communications model that
enables parallel programs to be written in languages such as C, Fortran, Python, etc. Thus, unlike
PVM which provides a concrete implementation, MPI is a specification which has been
implemented in systems such as MPICH and Open MPI.

Cluster management
One of the challenges in the use of a computer cluster is the cost of administrating it which can at
times be as high as the cost of administrating N independent machines, if the cluster has N
nodes. In some cases this provides an advantage to shared memory architectures with lower
administration costs.This has also made virtual machines popular, due to the ease of
administration.

Task scheduling
When a large multi-user cluster needs to access very large amounts of data, task scheduling
becomes a challenge. The MapReduce approach was suggested by Google in 2004 and other
algorithms such as Hadoop have been implemented.
However, given that in a complex application environment the performance of each job
depends on the characteristics of the underlying cluster, mapping tasks onto CPU cores and GPU
devices provides significant challenges.[17] This is an area of ongoing research and algorithms
that combine and extend MapReduce and Hadoop have been proposed and studied. [17]

Node failure management


When a node in a cluster fails, strategies such as "fencing" may be employed to keep the rest of
the system operational. Fencing is the process of isolating a node or protecting shared resources
when a node appears to be malfunctioning. There are two classes of fencing methods; one
disables a node itself, and the other disallows access to resources such as shared disks.
The STONITH method stands for "Shoot The Other Node In The Head", meaning that the
suspected node is disabled or powered off. For instance, power fencing uses a power controller to
turn off an inoperable node.
The resources fencing approach disallows access to resources without powering off the node.
This may include persistent reservation fencing via the SCSI3, fibre Channel fencing to disable
the fibre channel port or global network block device (GNBD) fencing to disable access to the
GNBD server.

Software development and administration


Parallel programming
Load balancing clusters such as web servers use cluster architectures to support a large
number of users and typically each user request is routed to a specific node, achieving task
parallelism without multi-node cooperation, given that the main goal of the system is providing
rapid user access to shared data. However, "computer clusters" which perform complex
computations for a small number of users need to take advantage of the parallel processing
capabilities of the cluster and partition "the same computation" among several nodes. [20]

Automatic parallelization of programs continues to remain a technical challenge, but parallel


programming models can be used to effectuate a higher degree of parallelism via the
simultaneous execution of separate portions of a program on different processors. [20][21]

Debugging and monitoring


The development and debugging of parallel programs on a cluster requires parallel language
primitives as well as suitable tools such as those discussed by the High Performance Debugging
Forum (HPDF) which resulted in the HPD specifications.[13][22] Tools such as TotalView were
then developed to debug parallel implementations on computer clusters which use MPI or PVM
for message passing.
The Berkeley NOW (Network of Workstations) system gathers cluster data and stores them in a
database, while a system such as PARMON, developed in India, allows for the visual
observation and management of large clusters.[13]
Application checkpointing can be used to restore a given state of the system when a node fails
during a long multi-node computation.[23] This is essential in large clusters, given that as the
number of nodes increases, so does the likelihood of node failure under heavy computational
loads. Checkpointing can restore the system to a stable state so that processing can resume
without having to recompute results.[23]

Price performance
Clustering can provide significant performance benefits versus price. The System X
supercomputer at Virginia Tech, the 28th most powerful supercomputer on Earth as of June
2006,is a 12.25 TFlops computer cluster of 1100 Apple XServe G5 2.3 GHz dual-processor
machines (4 GB RAM, 80 GB SATA HD) running Mac OS X and using InfiniBand
interconnect. The cluster initially consisted of Power Mac G5s; the rack-mountable XServes are
denser than desktop Macs, reducing the aggregate size of the cluster. The total cost of the
previous Power Mac system was $5.2 million, a tenth of the cost of slower mainframe computer
supercomputers. (The Power Mac G5s were sold off.)

A Cluster Computer and its Architecture


A cluster is a type of parallel or distributed processing system, which consists of a collection of
interconnected stand-alone computers working together as a single,integrated computing
resource.
A computer node can be a single or multiprocessor system (PCs, workstations, or SMPs) with
memory, I/O facilities, and an operating system. A cluster generally refers to two or more
computers (nodes) connected together. The nodes can exist in a single cabinet or be physically
separated and connected via a LAN. An inter-connected (LAN-based) cluster of computers can
appear as a single system to users and applications. Such a system can provide a cost-effective
way to gain featuresand benefits (fast and reliable services) that have historically been found
only on more expensive proprietary shared memory systems. The typical architecture of a cluster
is shown below

The following are some prominent components of cluster computers:


Multiple High Performance Computers (PCs, Workstations, or SMPs)
State-of-the-art Operating Systems (Layered or Micro-kernel based)
High Performance Networks/Switches (such as Gigabit Ethernet and Myrinet)
Network Interface Cards (NICs)
Fast Communication Protocols and Services (such as Active and Fast Messages)
Cluster Middleware (Single System Image (SSI) and System Availability Infrastructure)
o Hardware (such as Digital (DEC) Memory Channel, hardware DSM, and SMP
techniques)
o Operating System Kernel or Gluing Layer (such as Solaris MC and GLU-nix)
o Applications and Subsystems
Applications (such as system management tools and electronic forms)
Runtime Systems (such as software DSM and parallel file system)
Resource Management and Scheduling software (such as LSF (Load
Sharing Facility) and CODINE (COmputing in DIstributed Networked
Environments))
Parallel Programming Environments and Tools (such as compilers, PVM (Parallel
Virtual Machine), and MPI (Message Passing Interface))
Applications
o Sequential
o Parallel or Distributed
The network interface hardware acts as a communication processor and is responsible for
transmitting and receiving packets of data between cluster nodes via a network/switch.
Communication software offers a means of fast and reliable data communication among cluster
nodes and to the outside world. Often, clusters with a special network/switch like Myrinet use
communication protocols such as active messages for fast communication among its nodes. They
potentially bypass the operating system and thus remove the critical communication overheads
providing direct user-level access to the network interface.
The cluster nodes can work collectively, as an integrated computing resource, or they can operate
as individual computers. The cluster middleware is responsible for offering an illusion of a

unified system image (single system image) and availability out of a collection on independent
but interconnected computers.
Programming environments can o_er portable, e_cient, and easy-to-use tools for development of
applications. They include message passing libraries, debuggers, and profilers. It should not be
forgotten that clusters could be used for the execution of sequential or parallel applications.

Clusters Classifications
Clusters offer the following features at a relatively low cost:
High Performance
Expandability and Scalability
High Throughput
High Availability
Cluster technology permits organizations to boost their processing power using standard
technology (commodity hardware and software components) that can be acquired/purchased at a
relatively low cost. This provides expandability- an affordable upgrade path that lets rganizations
increase their computing power-while preserving their existing investment and without incurring
a lot of extra expenses.The performance of applications also improves with the support of
scalable software environment. Another benefit of clustering is a failover capability that allows
a backup computer to take over the tasks of a failed computer located in its cluster.
Clusters are classified into many categories based on various factors as indicated below.
1. Application Target - Computational science or mission-critical applications.
High Performance (HP) Clusters
High Availability (HA) Clusters
2. Node Ownership - Owned by an individual or dedicated as a cluster node.
Dedicated Clusters
Nondedicated Clusters
The distinction between these two cases is based on the ownership of the nodes in a cluster. In
the case of dedicated clusters, a particular individual does not own a workstation; the resources
are shared so that parallel computing can be performed across the entire cluster. The alternative
nondedicated case is where individuals own workstations and applications are executed by
stealing idle CPU cycles. The motivation for this scenario is based on the fact that most
workstation CPU cycles are unused, even during peak hours. Parallel computing on a
dynamically changing set of nondedicated workstations is called adaptive parallel computing.
In nondedicated clusters, a tension exists between the workstation owners and remote users who
need the workstations to run their application. The former expects fast interactive response from
their workstation, while the latter is only concerned with fast application turnaround by utilizing
any spare CPU cycles. This emphasis on sharing the processing resources erodes the concept of
node ownership and introduces the need for complexities such as process migration and load
balancing strategies. Such strategies allow clusters to deliver adequate interactive performance as
well as to provide shared resources to demanding sequential and parallel applications.
3. Node Hardware - PC, Workstation, or SMP.
Clusters of PCs (CoPs) or Piles of PCs (PoPs)
Clusters of Workstations (COWs)
Clusters of SMPs (CLUMPs)

4. Node Operating System - Linux, NT, Solaris, AIX, etc.


Linux Clusters (e.g., Beowulf)
Solaris Clusters (e.g., Berkeley NOW)
NT Clusters (e.g., HPVM)
AIX Clusters (e.g., IBM SP2)
Digital VMS Clusters
HP-UX clusters.
Microsoft Wolfpack clusters.
5. Node Configuration - Node architecture and type of OS it is loaded with.
Homogeneous Clusters: All nodes will have similar architectures and run the same OSs.
Heterogeneous Clusters: All nodes will have di_erent architectures and run di_erent OSs.
6. Levels of Clustering - Based on location of nodes and their count.
Group Clusters(#nodes: 2-99): Nodes are connected by SANs (System Area Networks)
like Myrinet and they are either stacked into a frame or exist within a center.
Departmental Clusters (#nodes: 10s to 100s)
Organizational Clusters (#nodes: many 100s)
National Metacomputers (WAN/Internet-based): (#nodes: many
departmental/organizational systems or clusters)
International Metacomputers (Internet-based): (#nodes: 1000s to many millions)
Individual clusters may be interconnected to form a larger system (clusters of clusters) and, in
fact, the Internet itself can be used as a computing cluster. The use of wide-area networks of
computer resources for high performance computing has led to the emergence of a new _eld
called Metacomputing.

Commodity Components for Clusters


The improvements in workstation and network performance, as well as the availability of
standardized programming APIs, are paving the way for the widespread usage of cluster-based
parallel systems. In this section, we discuss some of the hardware and software components
commonly used to build clusters and nodes.

Processors
Over the past two decades, phenomenal progress has taken place in microprocessor
architecture (for example RISC, CISC, VLIW, and Vector) and this is making the single-chip
CPUs almost as powerful as processors used in supercomputers. Most recently researchers have
been trying to integrate processor and memory or network interface into a single chip. The
Berkeley Intelligent RAM (IRAM) project is exploring the entire spectrum of issues involved in
designing general purpose computer systems that integrate a processor and DRAM onto a single
chip- from circuits, VLSI design, and architectures to compilers and operating systems. Digital,
with its Alpha 21364 processor, is trying to integrate processing, memory controller, and
network interface into a single chip.

Memory and Cache


Originally, the memory present within a PC was 640 KBytes, usually `hardwired' onto the
motherboard. Typically, a PC today is delivered with between 32 and 64 MBytes installed in
slots with each slot holding a Standard Industry Memory Module (SIMM); the potential capacity
of a PC is now many hundreds of MBytes. Computer systems can use various types of memory

and they include Extended Data Out (EDO) and fast page. EDO allows the next access to begin
while the previous data is still being read, and fast page allows multiple adjacent accesses to be
made more efficiently. The amount of memory needed for the cluster is likely to be determined
by the cluster target applications. Programs that are parallelized should be distributed
such that the memory, as well as the processing, is distributed between processors for scalability.
Thus, it is not necessary to have a RAM that can hold the entire problem in memory on each
system, but it should be enough to avoid the occurrence of too much swapping of memory blocks
(page-misses) to disk, since disk access has a large impact on performance.
Access to DRAM is extremely slow compared to the speed of the processor, taking up to orders
of magnitude more time than a CPU clock cycle. Caches are used to keep recently used blocks of
memory for very fast access if the CPU references a word from that block again. However, the
very fast memory used for cache is expensive and cache control circuitry becomes more complex
as the size of the cache grows. Because of these limitations, the total size of a cache is usually in
the range of 8KB to 2MB. Within Pentium-based machines it is not uncommon to have a 64-bit
wide memory bus as well as a chip set that supports 2 MBytes of external cache. These
improvements were necessary to exploit the full power of the Pentium and to make the memory
architecture very similar to that of UNIX workstations.

Disk and I/O


Improvements in disk access time have not kept pace with microprocessor performance, which
has been improving by 50 percent or more per year. Although magnetic media densities have
increased, reducing disk transfer times by approximately 60 to 80 percent per year, overall
improvement in disk access times, which rely upon advances in mechanical systems, has been
less than 10 percent per year. Grand challenge applications often need to process large amounts
of data and data sets. Amdahl's law implies that the speed-up obtained from faster processors is
limited by the slowest system component; therefore, it is necessary to improve I/O performance
such that it balances with CPU performance. One way of improving I/O performance is to carry
out I/O operations in parallel, which is supported by parallel file systems based on hardware or
software RAID. Since hardware RAIDs can be expensive, software RAIDs can be constructed by
using disks associated with each workstation in the cluster.
System Bus
The initial PC bus (AT, or now known as ISA bus) used was clocked at 5 MHz and was 8 bits
wide. When _rst introduced, its abilities were well matched to the rest of the system. PCs are
modular systems and until fairly recently only the processor and memory were located on the
motherboard, other components were typically found on daughter cards connected via a system
bus. The performance of PCs has increased by orders of magnitude since the ISA bus was first
used, and it has consequently become a bottleneck, which has limited the machine throughput.
The ISA bu was extended to be 16 bits wide and was clocked in excess of 13 MHz.This,
however, is still not su_cient to meet the demands of the latest CPUs, disk interfaces, and other
peripherals. A group of PC manufacturers introduced the VESA local bus, a 32-bit bus that
matched the system's clock speed. The VESA bus has largely been superseded by the Intelcreated PCI bus, which allows 133 Mbytes/s transfers and is used inside Pentium-based PCs. PCI
has also been adopted for use in non-Intel based platforms such as the Digital AlphaServer range.
This has further blurred the distinction between PCs and workstations, as the I/O subsystem of a
workstation may be built from commodity interface and interconnect cards.

Cluster Applications
Clusters have been employed as an execution platform for a range of application classes, ranging
from supercomputing and mission-critical ones, through to ecommerce, and database-based ones.
Clusters are being used as execution environments for Grand Challenge Applications (GCAs)
[57] such as weather modeling, automobile crash simulations, life sciences, computational fluid
dynamics, nuclear simulations, image processing, electromagnetics, data mining, aerodynamics
and astrophysics. These applications are generally considered intractable without the use of stateof-the-art parallel supercomputers. The scale of their resource requirements, such as processing
time, memory, and communication needs distinguishes GCAs from other applications. For
example, the execution of scientific applications used in predicting life-threatening situations
such as earthquakes or hurricanes requires enormous computational power and storage resources.
In the past, these applications would be run on vector or parallel supercomputers costing millions
of dollars in order to calculate predictions well in advance of the actual events. Such applications
can be migrated to run on commodity off-the-shelf-based clusters and deliver comparable
performance at a much lower cost.
In fact, in many situation expensive parallel supercomputers have been replaced by low-cost
commodity Linux clusters in order to reduce maintenance costs and increase overall
computational resources. Clusters are increasingly being used for running commercial
applications. In a business environment, for example in a bank, many of its activities are
automated. However, a problem will arise if the server that is handling customer transactions
fails. The banks activities could come to halt and customers would not be able to deposit or
withdraw money from their account. Such situations can cause a great deal of inconvenience and
result in loss of business and confidence in a bank. This is where clusters can be useful. A bank
could continue to operate even after the failure of a server by automatically isolating failed
components and migrating activities to alternative resources as a means of offering an
uninterrupted service.
With the increasing popularity of the Web, computer system availability is becoming critical,
especially for e-commerce applications. Clusters are used to host many new Internet service
sites. For example, free email sites like Hotmail , and search sites like Hotbot use clusters.
Cluster-based systems can be used to execute many Internet applications:
Web servers
Search engines
Email
Security
Proxy
Database servers
In the commercial arena these servers can be consolidated to create what is known as an
enterprise server. The servers can be optimized, tuned, and managed for increased efficiency and
responsiveness depending on the workload through various loadbalancing techniques. A large
number of low-end machines (PCs) can be clustered along with storage and applications for
scalability, high availability, and performance.
The Linux Virtual Server [66] is a cluster of servers, connected by a fast network. It provides a
viable platform for building scalable, cost-effective and a more reliable Internet service than a
tightly coupled multi-processor system since failed components can be easily isolated and the
system can continue to operate without any disruption. The Linux Virtual Server directs clients

network connection requests to the different servers according to a scheduling algorithm and
makes the parallel services of the cluster appear as a single virtual service with a single IP
address. Prototypes of the Linux Virtual Server have already been used to build many sites that
cope with heavy loads, Client applications interact with the cluster as if it were a single server.
The clients are not affected by the interaction with the cluster and do not need modification. The
applications performance and scalability is achieved by adding one or more nodes to the cluster,
by automatically detecting node or daemon failures and by reconfiguring the system
appropriately to achieve high availability.
Clusters have proved themselves to be effective for a variety of data mining applications. The
data mining process involves both compute and data intensive operations. Clusters provide two
fundamental roles:
Data-clusters that provide the storage and data management services for the data sets being
mined.
Compute-clusters that provide the computational resources required by the data filtering,
preparation and mining tasks.
The Terabyte Challenge [69] is an open, distributed testbed for experimental work related to
managing, mining and modelling large, massive and distributed data sets.
The Terabyte Challenge is sponsored by the National Scalable Cluster Project (NSCP) [70], the
National Center for Data Mining (NCDM) [71], and the Data Mining Group (DMG) [72]. The
Terabyte Challenges testbed is organized into workgroup clusters connected with a mixture of
traditional and high performance networks. They define a meta-cluster as a 100-node workgroup of clusters connected via TCP/IP, and a supercluster as a cluster connected via a high
performance network such as Myrinet. The main applications of the Terabyte Challenge include:
A high energy physics data mining system called EventStore;
A distributed health care data mining system called MedStore;
A web documents data mining system called Alexa Crawler [73];
Other applications such as distributed BLAST search, textural data mining and economic data
mining.
An underlying technology of the NSCP is a distributed data mining system called Papyrus.
Papyrus has a layered infrastructure for high performance, wide area data mining and predictive
modelling. Papyrus is built over a data-warehousing layer, which can move data over both
commodity and proprietary networks. Papyrus is specifically designed to support various cluster
configurations, it is the first distributed data mining system to be designed with the flexibility of
moving data, moving predictive, or moving the results of local computations.

S-ar putea să vă placă și