Documente Academic
Documente Profesional
Documente Cultură
Course Timeline
Friday
10:00-12:00 History of Cloud Computing: Time-sharing, virtual machines, datacenter architectures, utility computing 12:00-13:30 Lunch 13:30-15:00 Modern Cloud Computing: economics, elasticity, failures 15:00-15:30 Break 15:30-17:00 Cloud Computing Infrastructure: networking, storage, computation models
Monday
10:00-12:00 Cloud Computing research topics: scheduling, multiple datacenters, testbeds
slave node
Problem
Rapid innovation in cluster computing frameworks No single framework optimal for all applications Energy efficiency means maximizing cluster utilization Want to run multiple frameworks in a single cluster
Apache Hama
Pregel
Pig
Dryad
Solution
Nexus is an operating system for the cluster over which diverse frameworks can run
Goals
Scalable Robust (i.e., simple enough to harden) Flexible enough for a variety of different cluster frameworks Extensible enough to encourage innovative future frameworks
Data locality compromised if machine held for long time Hard to account for new frameworks and changing demands -> hurts utilization and interactivity
Hadoop 2
Hadoop 3
Frameworks can take turns accessing data on each node Can resize frameworks shares to get utilization & interactivity
Hadoop 2 Hadoop 3
Hadoop 1 Hadoop 3
1 Hadoop 3 Hadoop 2
Hadoop 2 Hadoop 1
Hadoop 3 Hadoop 2 1
Hadoop 2 Hadoop 3
Requires encoding a frameworks semantics using the language, which is complex and can lead to ambiguities Restricts frameworks if specification is unanticipated
Outline
Nexus Architecture Resource Allocation Multi-Resource Fairness Implementation Results
NEXUS ARCHITECTURE
Overview
Hadoop job Hadoop job MPI job MPI scheduler Hadoop v19 scheduler v20 scheduler Hadoop
Nexus master
Nexus slave
Hadoop Hadoop v20 executor
Nexus slave
MPI v19 executor executor task
Nexus slave
MPI Hadoop v19 executor executor task
task
task
task
Resource Offers
MPI job MPI scheduler Hadoop job Hadoop scheduler
Nexus slave
MPI executor
Nexus slave
MPI executor
task
task
Resource Offers
MPI job MPI scheduler Hadoop job Hadoop scheduler
task
task
Resource Offers
MPI job MPI scheduler Hadoop job Hadoop task scheduler
Framework-specific scheduling
Nexus slave
MPI executor
Nexus slave
MPI executor Hadoop executor
task
task
Timeouts can be added to filters Frameworks can signal when to destroy filters, or when they want
Framework waits for offers on nodes that have its data If waited longer than a certain delay, starts launching non-local tasks
Framework Isolation
Isolation mechanism is pluggable due to the inherent perfomance/isolation tradeoff Current implementation supports Solaris projects and Linux containers
Both isolate CPU, memory and network bandwidth Linux developers working on disk IO isolation
RESOURCE ALLOCATION
Allocation Policies
Nexus picks framework to offer resources to, and hence controls how many resources each framework can get (but not which) Allocation policies are pluggable to suit organization needs, through allocation modules
Facebook.com
20% 100% 80% 0
Spam
30 20% 100% 6% 14% 70%
Ads
Job 3
User 1
Curr Curr Time Time Curr Time
User 2
Job 4
Job 1
Job 2
Revocation
Killing tasks to make room for other users Not the normal case because finegrained tasks enable quick reallocation of resources Sometimes necessary:
Long running tasks never relinquishing resources Buggy job running forever Greedy user who decides to makes his task long
Revocation Mechanism
Revoke only if a user is below its safe share and is interested in offers
Revoke tasks from users farthest above their safe share Framework warned before its task is killed
Giving each user a small safe share may not be enough if jobs need many machines Can run a traditional grid or HPC scheduler as a user with a larger safe share of the cluster, and have MPI jobs queue up on it
Spam
User 1
Job 1 Job 2
Ads
Job 1
Torque
User 2
Job 4
MULTI-RESOURCE FAIRNESS
What is Fair?
Goal: define a fair allocation of resources in the cluster between multiple users Example: suppose we have:
30 CPUs and 30 GB RAM Two users with equal shares User 1 needs <1 CPU, 1 GB RAM> per task User 2 needs <1 CPU, 3 GB RAM> per task
Fairness Properties
Scheduler Property Asset Dynamic x x x CEEI x x x x DRF x x x x x x x Pareto x efficiency Single-resource x fairness Bottleneck fairness Share guarantee Population monotonicity Envy-freedom Resource monotonicity x x
IMPLEMENTATION
Implementation Stats
Frameworks
Ported frameworks:
Hadoop (900 line patch) MPI (160 line wrapper scripts)
New frameworks:
Spark, Scala framework for iterative jobs (1300 lines) Apache+haproxy, elastic web server farm (200 lines)
RESULTS
Overhead
Hadoop 1
Hadoop 2
Hadoop 3
Hadoop 1
Hadoop 3 Hadoop 1
3 Hadoop 2 Hadoop 2
Hadoop 1 Hadoop 2
Hadoop 2
Hadoop 2 Hadoop 3
Hadoop 1 Hadoop 3
1 Hadoop 3 Hadoop 2
Hadoop 3
Hadoop 2 Hadoop 1
Hadoop 3 Hadoop 1 2
Hadoop 2 Hadoop 3
Nexus master
status update
Nexus slave
Nexus slave
Nexus slave
executor Load gen Web executor
task task task ) (Apache
Load gen executor Web executor Load gen executor Web executor
task task (Apache) task task (Apache)
Future Work
Experiment with parallel programming models Further explore low-latency services on Nexus (web applications, etc) Shared services (e.g. BigTable, GFS) Deploy to users and open source
AMAZON
Applications
MICROSOFT
Applications
Application Frameworks MapReduce, Sawzall, Google App Engine, Protocol Buffers Software Infrastructure
VM Management Job Scheduling
Software Infrastructure
VM Management
Software Infrastructure
VM Management
EC2
Job Scheduling Storage Management
Fabric Controller
Job Scheduling
Borg
Storage Management
Fabric Controller
Storage Management
GFS, BigTable
Monitoring
S3, EBS
Monitoring
Borg
Borg
Hardware Infrastructure
Storage Management
HDFS KFS Gluster Lustre PVFS MooseFS HBase Hypertable
Monitoring
Ganglia Nagios Zenoss MON Moara
Application Frameworks
VM Management
Eucalyptus Enomalism Tashi Reservoir Nimbus , oVirt
Hardware Infrastructure
PRS Emulab Cobbler xCat
Shared :
Global services: sign on, monitoring, store. Open src stack (prs, tashi, hadoop 9 sites currently, target of around 20 in the next two years.
Runs its own research and technical teams Contributes individual technologies Operates some of the global services HP site supports portal and PRS Intel site developing and supporting Tashi Yahoo! contributes to Hadoop
1 Gb/s (x4) *
Switch 48 Gb/s
1 Gb/s (x4)
1 Gb/s (x8)
Switch 24 Gb/s
1 Gb/s (x4)
Switch 48 Gb/s
1 Gb/s (x4x4 p2p) Blade Rack 40 nodes
20 nodes : 1 Xeon (1-core) [Irwindale /Pent4 ], 6GB DRAM, 366GB disk (36+300GB) 10 nodes: 2 Xeon 5160 (2-core) [Woodcrest /Core], 4GB RAM, 2 75GB disks 10 nodes: 2 Xeon E5345 (4-core) [Clovertown /Core ],8GB DRAM, 2 150GB Disk
Switch 48 Gb/s
1 Gb/s (x4x4 p2p) Blade Rack 40 nodes
2 Xeon E5345 (quad-core) [Clovertown / Core] 8GB DRAM 2 150GB Disk
Switch 48 Gb/s
1 Gb/s (x15 p2p) 1U Rack 15 nodes
2 Xeon E5420 (quad-core ) [Harpertown / Core 2] 8GB DRAM 2 1TB Disk
Switch 48 Gb/s
1 Gb/s (x15 p2p) 2U Rack 15 nodes
2 Xeon E5440 (quad-core) [Harpertown / Core 2] 8GB DRAM 6 1TB Disk
Switch 48 Gb/s
1 Gb/s (x15 p2p) 2U Rack 15 nodes
2 Xeon E5520 (quad -core) [Nehalem -EP/ Core i7] 16GB DRAM 6 1TB Disk
(r1r5)
PDU w/per-port monitoring and control
(r2r1c1-4)
(r2r2c1-4)
(r1r1, r1r2)
x2
x3
(r3r2, r3r3)
x2
Key:
Nodes Cores
r1r3 r1r4 r2r1c1-4 r2r2c1-4 r1r1 r1r2 r2r3 r3r2 r3r3 40 40 30 45 30 140 320 240 360 240
IDA
2,400
300
100
4.8TB
600
Intel
1,364
198
145
1.8TB
1Gb/s
KIT
2,048
256
128
10TB
1Gb/s
UIUC
1,024
128
64
2TB
~500TB
288
1Gb/s
CMU
1,024
128
64
2TB
--
--
1 Gb/s
Yahoo (M45)
3,200
480
400
2.4TB
1.2PB
1600
1Gb/s
Hadoop on demand
26 . 3 TB
4 PB
Testbed Comparison
Testbeds
Open Cirrus
Type of research
IBM/Google TeraGrid
PlanetLab EmuLab
Systems & Dataservices intensive applications research Federation of heterogeneous data centers A cluster supported by Google and IBM
Systems
Interoperab. Commer. Systems across clouds use using open APIs Raw Re-use of access to LANLs virtual retiring machines clusters Amazon CMU, LANL, NSF
Approach
A few 100 nodes hosted by research instit. Many schools and orgs
A single-site Multi-site cluster with heteros flexible clusters, control focus on network University of 4 centers Utah
IDA, KIT, UIUC, Yahoo! CMU Distribution 7(9) sites 1,746 nodes 12,074 cores
IBM, Google, Many Stanford, schools U.Wash, and orgs MIT 1 site 11 partners in US
> 700 >300 nodes 480 cores, 1 site nodes univ@Utah distributed in world-wide four locations
Research
Tashi
Zoni service
Research
Tashi
Zoni service
1.Application running 2.On Hadoop 3.On Tashi virtual cluster 4.On a PRS 5.On real hardware
Research
Tashi
Zoni service
Experiment/ save/restore
Research
Tashi
Zoni service
Experiment/ save/restore
Research
Tashi
Zoni service
Experiment/ save/restore
Platform services
Research
Tashi
Zoni service
Experiment/ save/restore
Platform services
Research
Tashi
Zoni
System Organization
Compute nodes are divided into dynamicallyallocated, vlan-isolated PRS subdomains Apps switch
Open service research Tashi development Production storage service Proprietary service research Open workload monitoring and trace collection
Zoni code from HP being merged into Tashi Apache project and extended by Intel
Running on HP site Being ported to Intel site Will eventually run on all sites
Research focus:
Location-aware co-scheduling of VMs, storage, and power. Seamless physical/virtual migration.
Joint with Greg Ganger (CMU), Mor Harchol-Balter (CMU), Milan Milenkovic (CTG)
Sche dule r
C lu ste r M anager
Nod e Nod e
The storage service aggregates the capacity of the commodity nodes to house Big Data repositories.
Nod e Nod e Nod e Nod e
Random Placement
Location-Aware Placement
http://wiki.apache.org/hadoop/ProjectDescr
Provides a parallel programming model (MapReduce), a distributed file system, and a parallel database (HDFS)
73
What kinds of research projects are Open Cirrus sites looking for?
Open Cirrus is seeking research in the following areas (different centers will weight these differently):
Datacenter federation Datacenter management Web services Data-intensive applications and systems
Contact names, email addresses, and web links for applications to each site will be available on the Open Cirrus Web site (which goes live Q209)
http://opencirrus.org
Each Open Cirrus site decides which users and projects get access to its site.
Explore location-aware and power-aware workload scheduling Develop integrated physical/virtual allocations to combat cluster squatting Design cloud storage models
Isolation Research
Need predictable variance over raw performance Some resources that people have run into problems with:
Power, disk space, disk I/O rate (drive, bus), memory space (user/kernel), memory bus, cache at all levels (TLB, etc), hyperthreading/etc, CPU rate, interrupts Network: NIC (Rx/Tx), Switch, crossdatacenter, cross-country OS resources: File descriptors, ports,
Datacenter Energy
EPA, 8/2007:
1.5% of total U.S. energy consumption Growing from 60 to 100 Billion kWh in 5 yrs 48% of typical IT budget spent on energy
75 MW new DC deployments in PG&Es service area that they know about! (expect another 2x) Microsoft: $500m new Chicago facility
Three substations with a capacity of 198MW 200+ shipping containers w/ 2,000 servers each
Power/Cooling Issues
81
Within DC racks, network equipment often the hottest components in the hot spot
Rack Switch
M . K . Pa tte rso n , A . Pra tt, P. K u m a r, From UPS to Silicon an end -to-end evaluation of : d a ta ce n te r e ffi e n cy , I te lC o rp o ra ti n ci n o
Hard state (proxying), soft state (caching), protocol/data streamlining for power as well as b/w reduction
Summary
Many areas for research into Cloud Computing!
Datacenter design, languages, scheduling, isolation, energy efficiency (at all levels)
UC Berkeley
Thank you!
adj@eecs.berkeley.edu http://abovetheclouds.cs.berkeley. edu/
86