Sunteți pe pagina 1din 93

APP-CAP1426

The Benefits of Virtualization for Middleware

Jeff Battisti, Cardinal Health Emad Benjamin, VMware, Inc.

#vmworldapps

Disclaimer

This session may contain product features that are


currently under development.

This session/overview of the new technology represents


no commitment from VMware to deliver these features in any generally available product.

Features are subject to change, and must not be included in


contracts, purchase orders, or sales agreements of any kind.

Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features
discussed or presented have not been determined.

About the speaker

I have been with VMware for the last 7 years, working on Java
and vSphere

20 years experience as a Software Engineer/Architect, with last 15


years focused on Java development

Open source contributions Prior work with Cisco, Oracle, and Banking/Trading Systems. Authored the Enterprise Java Applications Architecture on VMware

Disclaimer

This session may contain product features that are


currently under development.

This session/overview of the new technology represents


no commitment from VMware to deliver these features in any generally available product.

Features are subject to change, and must not be included in


contracts, purchase orders, or sales agreements of any kind.

Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features
discussed or presented have not been determined.

Agenda

Conventional Middleware Platforms Middleware Platform Architecture on VMware vSphere

Design and Sizing Middleware Platforms


Performance Benefits of Virtualizing Middleware Customer Success Stories Questions

Conventional Middleware Platform

Enterprise Java applications are multitier (Client-Server) The - in Client-Server is essentially the Middleware

Load Balancer Tier


Load Balancers

Web Server Tier


Web Servers

Java App Tier


Java Applications

DB Server Tier
DB Servers

IT Operations Network Team

IT Operations Server Team

IT Apps Java Dev Team

IT Ops & Apps Dev Team

Organizational Key Stakeholder Departments

Middleware Platform Architecture on vSphere


Load Balancers as VMs
High Uptime, Scalable, and Dynamic Enterprise Java Applications
Load balancers Web Servers Java Applications DB Servers

Web Servers
VMware vSphere

APPLICATION SERVICES

Java Application Servers

Capacity On Demand

Dynamic

High Availability

SHARED INFRASTRUCTURE SERVICES

SHARED, ALWAYS-ON INFRASTRUCTURE

VMware vFabric

Programming Model

Rich Web

Social and Mobile

Data Access

Integration Patterns

Batch Framework

Spring Tool Suite

WaveMaker

Cloud Foundry

Java Runtime (tc Server)

Web Runtime (ERS)

Data Director)

Messaging Global Data In-mem SQL (SQLFire) (RabbitMQ) (GemFire)

App Monitoring Performance Mgmt (Spring Insight) (Hyperic)

Java Optimizations (EM4J, )

Virtual Datacenter
Cloud Infrastructure and Management

App Director

Step 2 Establish Benchmark

Scale Up Test

ESTABLISH BUILDING BLOCK VM Establish Vertical scalability Scale Up Test Establish how many JVMs on a VM? Establish how large a VM would be in terms of vCPU and memory

Investigate bottlnecked layer Network, Storage, Application Configuration, & vSphere

Building Block VM

If building block app/VM config problem, adjust & iterate

If scale out bottlenecked layer is removed, iterate scale out test

No

Test complete

SLA OK?

Scale Out Test


Building Block VM Building Block VM Building Block VM

DETERMINE HOW MANY VMs Establish Horizontal Scalability Scale Out Test How many VMs do you need to meet your Response Time SLAs without reaching 70%-80% saturation of CPU? Establish your Horizontal scalability Factor before bottleneck appear in your application

Design and Sizing HotSpot JVMs on vSphere

VM Memory

Guest OS Memory -Xss per thread Java Stack JVM Memory Perm Gen

Other mem
-XX:MaxPermSize

Direct native
memory Non Direct Memory Virtual

JVM Max Heap -Xmx

Initial Heap

-Xms

Address space

10

Design and Sizing of HotSpot JVMs on vSphere

Guest OS=Memory (depends on OS/other processes) VM Memory Guest OS approx Memory1G + JVM Memory
JVM Memory = JVM Max Heap (-Xmx value) + JVM Perm Size (-XX:MaxPermSize) +

Perm Size is an area additional the Xmx (Max Heap) value and NumberOfConcurrentThreads * (-Xss) +to other Mem is not GC-ed because it contains class-level information. other mem is additional mem required for NIO buffers, JIT code
cache, classloaders, Socket Buffers (receive/send), JNI, GC internal info

If you have multiple JVMs (N JVMs) on a VM then: VM Memory = Guest OS memory + N * JVM Memory

11

Sizing Example
set mem Reservation to 5088m VM Memory (5088m) JVM Memory (4588m) Guest OS Memory Java Stack

500m used by OS -Xss per thread (256k*100) Other mem (=217m)

Perm Gen

-XX:MaxPermSize (256m)

JVM Max Heap -Xmx (4096m)

Initial Heap -Xms (4096m)

12

Larger JVMs for In-Memory Data Management Systems


VM Memory for SQLFire (34g) JVM Memory for SQLFire (32g) JVM Max Heap -Xmx (30g) Set memory reservation to 34g Guest OS Memory Java Stack

0.5-1g used by OS -Xss per thread (1M*500) Other mem (=1g)

Perm Gen

-XX:MaxPermSize (0.5g)

Initial Heap -Xms (30g)

13

Middleware ESXi Dedicated Cluster


Locator/heart beat for middleware DO NOT VMotion Middleware

components

Memory Available for all VMs = 96*0.99 -1GB => 94GB Per NUMA memory => 94/2 47GB
14

96GB RAM 47GB RAM VMs with 8vCPU 2 sockets 8 pCPU per socket

ESX Scheduler

vCPU VMs Less than 47GB RAM on each VM

Each NUMA Node has 94/2 47GB

96 GB RAM on Server

15

Performance Perspective

See the Performance of Enterprise Java Applications on VMware


vSphere 4.1 and SpringSource tc Server at http://www.vmware.com/resources/techresources/10158 .

% CPU
R/T

80% Threshold

16

If given 4 vCPUs, which VM Configuration is better?


JVM-1 Web JVM-1 Web JVM-2 Web JVM-1 JVM-2 JVM-3 JVM-4 Web Web Web Web

1 VM
4vCPU 1 JVM 4GB

2 VMs
2vCPU on each VM 2 JVMs 2.5 GB each

4 VMs
1vCPU on each VM 4 JVMs 2 GB each

17

Most Common VM Size for Java workloads

2 vCPU VM with 1 JVM, for tier-1 production workloads Maintain this ratio as you scale out or scale-up, i.e. 1 JVM : 2vCPU Scale out preferred over Scale-up, but both can work You can diverge from this ratio for less critical workloads

2 vCPU VM 1 JVM (-Xmx 4096m) Approx 5GB RAM Reservation

18

However for Large JVMs + CMS


Start with 4+ vCPU VM with 1 JVM, for tier-1
in memory data management systems type of production workloads
For large JVMs 4+ vCPU VM 1 JVM (8-128GB)

Likely increase JVM size, instead of


launching a second JVM instance

Multiple 4vCPU+ will allow for


ParallelGCThreads to be allocated 50% of the available vCPUs to the JVM, i.e. 2 GC Threads +

Ability to increase ParallelGCThreads


is critical to YoungGen scalability for large JVMs

ParallelGCThreads should be allocated 50%


of available vCPU to the JVM and not more. You want to ascertain there other vCPUs available for other txns
19

Most Common Sizing and Configuration Question


JVM-1 JVM-2 JVM-1 JVM-3 JVM-2 JVM-4

Option-1 Scale out VM and JVM ( best)

Option-3 Scale out VM and JVM (3rd best)

JVM-1

JVM-2

JVM-1

JVM-2

Option-2 Scale Up JVM heap size (2nd best)


20

What else to consider when sizing?


JVM-2 Web Job Web Job

Mixed workloads Job Scheduler vs Web app require


different GC Tuning

Job Schedulers care about Throughput


Web apps care about minimize latency and
response time

You cant have both reduced response time and


increased throughput, without compromise

Separate the concerns for optimal tuning


JVM-3 Web Job JVM-4

JVM-1 Web Job

Vertical

Web

Horizontal
Job

21

Which GC?

ESX doesnt care which GC you select, because of the degree of


independence of Java to OS and OS to Hypervisor

22

Tuning GC Art Meets Science!

Either you tune for Throughput or Latency, one at the cost of the
other Reduce Latency improved R/T reduce latency impact slightly reduced throughput

Tuning Decisions

improved throughput longer R/T increased latency impact

Increase Throughput
23

Parallel Young Gen and CMS Old Gen


-Xmn Xmx minus Xmn

Young Generation Minor GC


Parallel GC in YoungGen using XX:ParNewGC & XX:ParallelGCThreads

Old Generation Major GC


Concurrent using in OldGen using XX:+UseConcMarkSweepGC

S S 0 1

minor GC threads
24

application threads

concurrent mark and sweep GC

High Level GC Tuning Recipe


Step CSurvivor Spaces Tuning Measure Minor GC Duration and Frequency Adjust Xmn Young Gen size and /or ParallelGCThreads Adjust Xmn And/or SurvivorSpaces

Step A-Young Gen Tuning

Measure Major GC Duration And Frequency

Adjust Heap space Xmx

Step B-Old Gen Tuning


25

CMS Collector Example


java Xms30g Xmx30g Xmn10g -XX:+UseConcMarkSweepGC -XX:+UseParNewGC _XX:CMSInitiatingOccupancyFraction=75 XX:+UseCMSInitiatingOccupancyOnly -XX:+ScavengeBeforeFullGC -XX:TargetSurvivorRatio=80 -XX:SurvivorRatio=8 -XX:+UseBiasedLocking -XX:MaxTenuringThreshold=15 -XX:ParallelGCThreads=4 -XX:+UseCompressedOops -XX:+OptimizeStringConcat -XX:+UseCompressedStrings

-XX:+UseStringCache

This JVM configuration scales up and down effectively -Xmx=-Xms, and Xmn 33% of Xmx -XX:ParallelGCThreads=< minimum 2 but less than 50% of available
vCPU to the JVM. NOTE: Ideally use it for 4vCPU VMs plus, but if used on 2vCPU VMs drop the -XX:ParallelGCThreads option and let Java select it
26

Middleware on VMware Best Practices


Enterprise Java Applications on VMware Best Practices Guide Best Practices for Performance Tuning of Latency-Sensitive Workloads in vSphere VMs High Performance Data with VMware vFabric GemFire Best Practices Guide http://www.vmware.com/resources/techresources/1087

http://www.vmware.com/resources/techresources/10220

http://www.vmware.com/resources/techresources/10231

27

Middleware on VMware Best Practices Summary

Follow the design and sizing examples we discussed thus far Set appropriate memory reservation Leave HT enabled, size bases on vCPU=1.25pCPU if needed RHEL6 and SLES 11 SP1 have tickless kernel that does not rely on a high frequency interrupt-based timer, and is therefore much friendlier to virtualized latency-sensitive workloads

Do not over commit memory Locators/heartbeat process should not be vMotion migrated, it
otherwise would lead to network split brain problems

vMotion over 10Gbps when doing scheduled maintenance Use Affinity and Anti-Affinity rules to avoid redundant copies on
the same VMware ESX/ESXi host

28

Middleware on VMware Best Practices

Disable NIC interrupt coalescing on physical and virtual NIC Extremely helpful in reducing latency for latency-sensitive virtual
machines

Disable virtual interrupt coalescing for VMXNET3


It can lead to some performance penalties for other virtual machines on the
ESXi host, as well as higher CPU utilization to deal with the higher rate of interrupts from the physical NIC

This implies it is best to use dedicated ESX cluster for Middleware


Platforms All host are configured the same way for latency sensitivity and this insures
non middleware workloads, such as other enterprise applications are not negatively impacted

29

SQLFire vs. Traditional RDBMS

30

SQLFire vs. Traditional RDBMS

31

SQLFire vs. Traditional RDBMS

SQLFire scaled 4x compared to RDBMS Response times of SQLFire are 5x to 30x faster than RDBMS Response times on SQLFire are more stable and constant with increased load. RDBMS response times increase with increased load

32

Middleware on VMware Benefits

Flexibility to change compute resources, VM sizes, add more hosts Ability to apply hardware and OS patches while minimizing
downtime

Create more manageable system through reduced middleware


sprawl

Ability to tune the entire stack within one platform Ability to monitor the entire stack within one platform Ability to handle seasonal workloads, commit resources when
they are needed and then remove them when not needed

33

Cardinal Health Java on WebSphere

Jeff Battisti
Sr. Enterprise Architect August 30, 2012

Copyright 2011, Cardinal Health, Inc. or one of its subsidiaries. All rights reserved.

About Cardinal Health


Founded in 1971 Leading provider of products and services across the healthcare supply chain; extensive footprint across multiple channels Serving >50,000 customers at >60,000 healthcare sites across North America daily Approximately one-third of all distributed pharmaceutical, laboratory and medical products in the U.S. and Puerto Rico flow through the Cardinal Health supply chain More than 30,000 employees; direct operations in 10 countries Number 19 on Fortune magazines list of 500 largest U.S. corporations

The business behind healthcare with the broadest view of the supply chain

Copyright 2011, Cardinal Health, Inc. or one of its subsidiaries. All rights reserved.

Agenda
Virtualization journey Why virtualize on WebSphere on VMWare Factors Impacting Migration/Expansion Virtualization Questions to answer
Performance and scalability High availability Licensing

Summary

36

Copyright 2011, Cardinal Health, Inc. or one of its subsidiaries. All rights reserved.

Virtualization

Journey
Theme Centralized IT Shared Service Capital Intensive - High Response

Variable Cost SubscriptionServices

Timeline
Virtual

2005 2008
Consolidation < 40% Virtual <2,000 VMs <2,355 physical Data Center Optimization 30 DCs to 2 DCs

2009 2012
Internal cloud >81% Virtual >4,054 VMs <1147 physical Power Remediation P2Vs on refresh

2013 2016
Cloud Resources >95% Virtual >10,000 VMs <800 physical Optimizing DCs Internal disaster recovery Metered service offerings (SAAS, PAAS, IAAS) Shrinking HW Footprint > 50% Utilization > 100:1 VM/Physical Cloud Computing Virtualized Databases Open Source Migrations

DC

HW

Transition to Blades <10% Utilization <10:1 VM/Physical

HW Commoditization 25% Utilization 60:1 VM/Physical

SW

Low Criticality Systems 8X5 Applications

Business Critical Systems WebSphere ~ 490 Unix to Linux ~ 655 SAP ~ 550

37

Copyright 2011, Cardinal Health, Inc. or one of its subsidiaries. All rights reserved.

Virtualization

Why Virtualize WebSphere on VMWare


DC strategy alignment
Pooled resources capacity ~15% utilization Elasticity for changing workloads Unix to Linux (>$36 million savings) Disaster Recovery w/o license implications

Simplification and manageability


High availability for thousands instead of thousands of high availability solutions Network & system management in DMZ

Five year cost savings ~ $6 million


Hardware Savings ~ $660K WAS Licensing ~ $862K Unix to Linux ~ $3.7M DMZ ports~ >$1M

38

Copyright 2011, Cardinal Health, Inc. or one of its subsidiaries. All rights reserved.

Virtualization

Questions to Answer
Alignment across teams and consistent messaging to stakeholders was critical Can the system perform & scale?
JVM Stacking & Vertical scaling Massive OLTP Performance Memory - over commitment, DRS, and large JVMs IO Impacts

Will VMWare provide challenges to the WebSphere HA Model?

Affinity/Anti-affinity Complexity must remain low Storage, VMWare, servers, chassis, and switches cannot be a single point of failure User error should not take down both sides of critical applications

Can we maintain or reduce WebSphere license costs?

JVM stacking Server affinity Over commitment in non-production systems

39

Copyright 2011, Cardinal Health, Inc. or one of its subsidiaries. All rights reserved.

Factors impacting Migration/Expansion


150% user growth & 90% of traffic is in a 3 hour time frame Moving to a new data center New DMZs In both data centers AIX to SUSE Linux migration Virtualizing on VMWare Upgrade WebSphere Stack:
WebSphere Commerce 6.0 to 7.0 WebSphere Portal 6.1.5 to 7.0 WebSphere Content Mgr 6.1.5 to 7.0 WAS Services layer WAS 6.1 to WAS 7.0 DB2 Platform 9.5 to v9.7 J2EE 1.4 | 32 bit to JEE 6 | 64 bit Siteminder Upgrade

Non-functional enhancements:
Central Logging framework Remote Portlet Rendering Positiong for future cross datacenter failover 64-bit JVM and JEE 6 support

Systems 111 DB Instances 31 Portal Servers 19 Commerce Servers 15 WAS Servers 7 WCM Servers 3 BODL Servers 11 Business Object Servers 15 Endecca Servers 9 Planet Press Servers

40

Copyright 2011, Cardinal Health, Inc. or one of its subsidiaries. All rights reserved.

Performance & Scalability

WebSphere VM Performance
Sweet spot is between 2 4 cores JVM stacking/vertical scaling reduced, but still in place for licensing reasons

20% Performance Difference 2 to 4 cores

Source: http://www.vmware.com/resources/techresources/10095
Copyright 2011, Cardinal Health, Inc. or one of its subsidiaries. All rights reserved.

Performance & Scalability

Intel vs. Power 6 Java Performance


Intel systems out performed Power systems by over 300%
SpecJbb
2,478,929
2,500,000

2,000,000

1,433,000
1,500,000

1,000,000

509,962 350,642
500,000

0 HP BL490 Nehalem 2.93 GHz 8 cores HP DL580 G7 32 cores* IBM Power7 750 3.55GHz IBM Power6 550 4.2 GHz 32 cores 8 cores

Copyright 2011, Cardinal Health, Inc. or one of its subsidiaries. All rights reserved.

Order Express Performance


9 8 7 6 5 250% 4 3 145% 2 1 0 150% 100% 207% 200% 400% 348% 316% 500%

150% customer growth 454%

450%
400% 350% 300%

50% 13%
View Dashboard Order Submit Buy Now View Product Details Quick Add Search Results View Cart 0%

Version 1

Version 2

% Diff

Copyright 2011, Cardinal Health, Inc. or one of its subsidiaries. All rights reserved.

High Availability

Low Complexity
High availability for thousands instead of thousands of high availability solutions

Challenges: Affinity/Anti-affinity Complexity No SPOFs User Error


44

Copyright 2011, Cardinal Health, Inc. or one of its subsidiaries. All rights reserved.

Licensing

WebSphere Non Production Model


Using Server Affinity with failover & CPU over allocation we save millions in licensing

45

Copyright 2011, Cardinal Health, Inc. or one of its subsidiaries. All rights reserved.

Summary
Culture shift from virtualization to save costs to virtualization for superior capabilities We are continually improving performance, resiliency, availability, and manageability The Simplicity of this design enables us to effectively and efficiently manage these systems

Copyright 2011, Cardinal Health, Inc. or one of its subsidiaries. All rights reserved.

Thanks you and are there any Questions?


Emad Benjamin, ebenjamin@vmware.com You can get my book here: https://www.createspace.com/3632131

47

Backup slides

48

ESX Memory Management

Ballooning makes the Guest OS aware that ESX Host memory is short Due to VMs isolation, the Guest OS is not aware that it is in a VM and is
not aware of the state of other VMs, or the ESX Host memory situation

Balloon driver is loaded into Guest OS, communicates with ESX via
private channel, and ESX will instruct it to inflate (allocate Guest OS Physical Pages) by an amount. An amount that it needs to reclaim.

49

Why did we use Memory Reservation?

Memory Reservation
We used an amount of 5088MB as Memory Reservation for the VM when all
the various memory segments were added up.

This is the physical ESX Host memory guaranteed to be available to the VM


up-on startup such that memory over commitment is avoided

Hence ESX memory management techniques such as Ballooning and


Swapping avoided in order to preserve performance

50

ESX Memory Management

Guest virtual memory is mapped to Guest Physical is in turn mapped to


ESX Physical Memory

51

ESX Memory Management

The below shows an over committed memory situation, only 4GB


physical is available but 6GB has been allocated to the VMs

ESX uses Transparent Page Sharing (TPS), Ballooning, and host


swapping, in order to support memory reclamation in an overcommitted situation

52

GC Policy Types
GC Policy Type Description

Concurrent GC

Concurrent Mark and Sweep, no compaction Concurrent implies when GC is running it doesn't pause your application threads this is the key difference to throughput/parallel GC Suited for application that care more about response time than throughput CMS does use more heap when compared to throughput/ParallelGC CMS works on OLD gen concurrently, but young generation is collected using ParNewGC, a version of the throughput collector Has multiple phases: Initial mark (short pause) concurrent mark (no pause)

Pre-cleaning (no pause)


re-mark (short pause) Concurrent sweeping (no pause)
G1
53

Only in J7 and mostly experimental, equivalent to CMS + compacting

ESX Memory Management

ESX Host Swapping this is a last resort, and things are quite bad at
this stage. If TPS and Ballooning didnt work, then ESX will swap VM memory to swap file.

54

Impact of Reducing Young Generation (-Xmn)


Young Gen Minor GC
More frequent Minor GC but shorter duration

Old Gen Major GC


Potentially increased Major GC duration

You can mitigate the increase in Major GC duration by decreasing -Xmx

55

Increasing Survivor Ratio Impact on Old Generation


Young Gen Minor GC
Old Gen Major GC
Increased Tenure ship/promotion to old Gen hence increased Major GC

S S 0 1

56

Why is Duration and Frequency of GC Important?


Young Gen Minor GC
Old Gen Major GC
We want to ensure regular application user threads get a chance to execute in between GC activity frequency frequency

Young Gen minor GC duration


57

Old Gen GC duration

Parallel Young Gen and CMS Old Gen


Young Gen Minor GC
Parallel/Throughput GC in YoungGen using XX:ParNewGC XX:ParallelGCThreads

Old Gen Major GC


Application user threads
Concurrent using XX:+UseConcMarkSweepGC

Minor GC threads
58

Concurrent Mark and Sweep

Parallel Young Gen and CMS Old Gen


-Xmn Xmx minus Xmn

Young Generation Minor GC


Parallel GC in YoungGen using XX:ParNewGC & XX:ParallelGCThreads

Old Generation Major GC


Concurrent using in OldGen using XX:+UseConcMarkSweepGC

minor GC threads
59

application threads

concurrent mark and sweep GC

Parallel Young Gen and CMS Old Gen


-Xmn Xmx minus Xmn

Young Generation Minor GC


Parallel GC in YoungGen using XX:ParNewGC & XX:ParallelGCThreads

Old Generation Major GC


Concurrent using in OldGen using XX:+UseConcMarkSweepGC

60

CMS Collector Example


JVM Option Description

-Xmn16g -XX:+UseConcMarkSweepGC

Fixed size Young Generation The concurrent collector is used to collect the tenured generation and does most of the collection concurrently with the execution of the application. The application is paused for short periods during the collection. A parallel version of the young generation copying collector is used with the concurrent collector. This sets whether to use multiple threads in the young generation (with CMS only!). By default, this is enabled in Java 6u13, probably any Java 6, when the machine has multiple processor cores. This sets the percentage of the heap that must be full before the JVM starts a concurrent collection in the tenured generation. The default is some where around 92 in Java 6, but that can lead to significant problems. Setting this lower allows CMS to run more often (all the time sometimes), but it often clears more quickly to avoid fragmentation.

-XX:+UseParNewGC

XX:CMSInitiatingOccupancyFraction= 51

61

CMS Collector Example


JVM Option Description

XX:+UseCMSInitiatingOccupancyOnly -XX:+ScavengeBeforeFullGC -XX:TargetSurvivorRatio=80 -XX:SurvivorRatio=8

Indicates all concurrent CMS cycles should start based on XX:CMSInitiatingOccupancyFraction=51 Do young generation GC prior to a full GC. Desired percentage of survivor space used after scavenge. Ratio of eden/survivor space size

62

CMS Collector Example


JVM Option Description

-XX:+UseBiasedLocking

Enables a technique for improving the performance of uncontended synchronization. An object is "biased" toward the thread which first acquires its monitor via a monitorenter bytecode or synchronized method invocation; subsequent monitor-related operations performed by that thread are relatively much faster on multiprocessor machines. Some applications with significant amounts of uncontended synchronization may attain significant speedups with this flag enabled; some applications with certain patterns of locking may see slowdowns, though attempts have been made to minimize the negative impact. Sets the maximum tenuring threshold for use in adaptive GC sizing. The current largest value is 15. The default value is 15 for the parallel collector and is 4 for CMS.

-XX:MaxTenuringThreshold=15

63

CMS Collector Example


JVM Option Description

-XX:ParallelGCThreads=6

Sets the number of garbage collection threads in the young and old parallel garbage collectors. The default value varies with the platform on which the JVM is running. Enables the use of compressed pointers (object references represented as 32 bit offsets instead of 64-bit pointers) for optimized 64-bit performance with Java heap sizes less than 32gb.

-XX:CompressedOops

-XX:+OptimizeStringConcat
-XX:+UseCompressedStrings

Optimize String concatenation operations where possible. (Introduced in Java 6 Update 20)
Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release) Enables caching of commonly allocated strings

-XX:+UseStringCache

64

IBM JVM - GC choice


-Xgc:mode -Xgcpolicy:Optthruput (Default) Usage Performs the mark and sweep operations during garbage collection when the application is paused to maximize application throughput. Mostly not suitable for multi CPU machines. Performs the mark and sweep concurrently while the application is running to minimize pause times; this provides best application response times. There is still a stop-the-world GC, but the pause is significantly shorter. After GC, the app threads help out and sweep objects (concurrent sweep). Treats short-lived and long-lived objects differently to provide a combination of lower pause times and high application throughput. Before the heap is filled up, each app helps out and mark objects (concurrent mark). Example Apps that demand a high throughput but are not very sensitive to the occasional long garbage collection pause Apps sensitive to long latencies transactionbased systems where Response Time are expected to be stable

Xgcpolicy:Optavgpause

-Xgcpolicy:Gencon

Latency sensitive apps, objects in the transaction don't survive beyond the transaction commit

65

jRockit JVM - GC choice


-Xgc:mode -Xgc:throughput (Default) -Xgc:genpar -Xgc:singlepar (non-gen) -Xgc:parallel (non-gen) Usage Optimizes for max throughput Example Apps that demand a high throughput but are not very sensitive to the occasional long garbage collection pause Apps sensitive to long latencies transactionbased systems where Response Time are expected to be stable

-Xgc:pausetime -Xgc:gencon -Xgc:singlecon (non-gen) -Default pause target is 500ms

Optimizes for short and even pause times. Can use -XpauseTarget:time The pause target affects the application throughput. A lower pause target inflicts more overhead on the memory management system. Optimizes for very short and deterministic pause times Can use XpauseTarget:time

-Xgc:deterministic

Apps with deterministic latencies transaction-based applications such as brokerage

66

What is the practical limit for JVM Memory sizing (not to scale)
Most limiting practical sizing factor is the per NUMA node RAM Guest OS Limit 1 to 16 TB 16 Exa Bytes

64 bit Java Theoretical Limit

ESX5i limit 32vCPU 1TB RAM

Physical Server limit ~256G <1TB

Per NUMA RAM

67

What are the practical and theoretical limits of JVM sizes


Java is 64bit hence theoretical limit is 16Exa Windows 2008 (64-bit) is 2TB. RHEL 5 is 1TB of RAM. RHEL 6 is 2TB of RAM. SUSE 11 is 16TB of RAM. ESXi5 limits 32vCPu and 1TB RAM Practical NUMA localization limits depended on the amount of RAM
available in the server hardware you select, divided by the number of CPU sockets on the server.

NUMA Local Memory = Total RAM on Server/Number of Processors

GC Tuning knowledge <4GB, <12GB, <32G, and >32GB


68

Next limitation is GC tuning knowledge for each JVM size


32GB to 128 GB

12 GB to 32GB 4GB to 12GB

<4GB
Enterprise webapps internal/ external
69

Large enterprise apps

Large Motoring systems, other large web scale public apps

Large Distributed Data platforms (trading systems)

Less than 4GB JVMs


90% percent of enterprise web apps will fit into this The ease of scale out in Java allows you to keep your JVM size
small you can add more VMs in horizontal scale-out fashion to service more traffic

Throughput Collector -XX:+UseParallelOldGC should be good enough Benefit from automatic 32bit address compression even though you are
in 64bit JVM, most JVMs do this automatically

<4GB
Enterprise webapps internal/ external
70

Less than 4GB to 12GB JVMs


Large enterprise applications Could be applications that dont have good horizontal scalability built in,
or just need the larger heap

Medium amount of GC tuning needed MinorGC frequency, -Xmn, and FullGC duration adjustments start to
become a consideration

Throughput Collector -XX:+UseParallelOldGC should be good enough


Use Large Pages if application consumes a lot of data
within a single thread of allocation 4GB to 12GB

You will need to turn on Hotspot XX:+UseCompressedOops


for up to 32GB, in Java 6 update 18 this is enabled by default

Large enterprise apps

71

12GB to 32GB JVMs


Large enterprise systems/monitoring tools for example 12GB to 32GB Could be applications that dont have good horizontal scalability built
in, or just need the larger heap

Medium amount of GC tuning needed MinorGC frequency, -Xmn, and FullGC duration adjustments start to
become a consideration

Use Large Pages


Likely CMS will be used if latency and response times not met due to
GC pause

12 GB to 32GB

You will need to turn in Hotspot XX:+UseCompressedOops for up to


32GB, in Java 6 update 18 this is enabled by default

Large Motoring systems, other large web scale public apps

72

Greater than 32GB to 128GB JVMs


Large distributed data platforms Could be applications that dont have good horizontal scalability
built in, or just need the larger heap

32GB to 128 GB

Extensive amount of GC tuning needed MinorGC frequency, -Xmn, and FullGC duration adjustments
start to become a consideration, survivor space sizing, *.*

Parallel GCThreads set to 50% of available cores


All about CMS, ie XX:+UseConcMarkSweepGC Use Large Pages if needed

XX:+UseCompressedOops no longer applicable


UseNUMA flag doesnt work with CMS, so you may have to rely
on numactl and/or esx

128GB is the extent of most $$ feasible Servers, assuming


256GB server with 2 Sockets.
73

Large Distributed Data platforms (trading systems)

Sizing Large JVMs


set mem Reservation to 31955m Guest OS Memory JVM Memory for GemFire (31455m) JVM Max Heap -Xmx (29696m) Java Stack

VM Memory for GemFire (31955)

500m used by OS -Xss per thread (192k*100) Other mem (=1484m)

Perm Gen

-XX:MaxPermSize (256m)

Initial Heap -Xms (29696m)

74

Sizing large JVMs

-XX:+UseNUMA in HotSpot JVM


Only available Java 6 update 2 Only available on XX:+UseParallelOldGC and XX:+UseParallelGC Only resort to NUMA tuning if it is a large JVM that spans multiple
memory/process nodes You can check to see N%L in esxtop if not a 100% then workload is not NUMA local

Esxi5 NUMA optimizations are effective for majority of cases


In Linux for singular JVMs use numacntl interleave, for multiple JVMs you
could use cpubind=<nodenumber> and memnode=<nodenumber> We typically dont see the need to do this and the esx scheduler does a pretty
good job

75

Sizing large JVMs and the Java Stack (-Xss)

Increasing Java stack may improve memory intensive workloads


In regular, small heap spaces <4GB found in typical webpps built with Java we
would decreased the default stack size in order to increase scalability

However, in memory intensive workloads if sufficient objects are created within


one thread and do NOT escape to another thread, you can: Increase the Xss to 1MB to 2MB range, but must be within L1 and L2 range You may still get improvement in speed of execution if you slightly exceed the
available L1 cache

This limit the number of horizontally scaled out, or concurrent threads you can fit
within the memory space

This is more suited to large JVMs and data intensive in memory data management
systems, i.e. distributed cache etc.

We are trying to localize as much of the execution within the L1, L2 and L3 and
NUMA, in that order of priority

76

Most Common VM Size for Java workloads

2 vCPU VM with 1 JVM, for tier-1 production workloads Maintain this ratio as you scale out or scale-up, i.e. 1 JVM : 2vCPU Scale out preferred over Scale-up, but both can work You can diverge from this ratio for less critical workloads

2 vCPU VM 1 JVM (-Xmx 4096m) Approx 5GB RAM Reservation

77

However for Large JVMs + CMS


Start with 4+ vCPU VM with 1 JVM, for tier-1
in memory data management systems type of production workloads For large JVMs 4+ vCPU VM 1 JVM (8-128GB)

Likely increase JVM size, instead of


launching a second JVM instance

Multiple 4vCPU+ will allow for


ParallelGCThreads to be allocated 50% of the available vCPUs to the JVM, i.e. 2 GC Threads +

Ability to increase ParallelGCThreads is


critical to YoungGen scalability for large JVMs

ParallelGCThreads should be allocated 50%


of available vCPU to the JVM and not more. You wan tot ensure there other vCPUs available for other txns
78

Large RAM NUMA Nodes on ESX Hosts

Most ESX Hosts have 48-144GB memory


range, it can be more in some cases 384GB

In order to consume this amount RAM,


designers are forced to increase the traditional JVM size

Take an example of 128GB and 2 Socket,


8 Core each

If we assume 2vCPU per JVM, then we


can have 8 JVMs/VMs

ESX memory overhead 1% 128*0.99/8 => 15.84GB Java process is typically 25% in addition
to Java heap, hence JVM Heap can be 11.88GB
79

An Example ESX Host BIOS Configuration

80

Performance Perspective See the Performance of Enterprise Java Applications on VMware


vSphere 4.1 and SpringSource tc Server at http://www.vmware.com/resources/techresources/10158 .

81

Performance Perspective

The 90th percentile response-time curves and CPU utilizations for


the 2 CPU native and virtualized cases.

Below 80% CPU utilization in the VM, the native and virtual
configurations have essentially identical performance, with only minimal absolute differences in response-times.

% CPU

R/T

80% Threshold

82

Performance Perspective

The 90th percentile response-time curves for increasing load in the


4 CPU native and virtualized cases.

The native and virtual configurations have essentially identical


performance across all loads, with only minimal differences in response-times

% CPU R/T

80% Threshold

83

Performance Perspective
Shows the peak throughput for a single instance of Olio running on tc Server, both natively and in a VM, with 1, 2, and 4 CPUs

84

If give 4 vCPUs, which VM Configuration is better?


JVM-1 Web JVM-1 Web JVM-2 Web JVM-1 JVM-2 JVM-3 JVM-4 Web Web Web Web

1 VM
4vCPU 1 JVM 4GB

2 VMs
2vCPU on each VM 2 JVMs 2.5 GB each

4 VMs
1vCPU on each VM 4 JVMs 2 GB each

85

Performance Perspective

Lowest CPU % Best R/T


Number of vCPUs per VM
1 2 4
86

Number of VMs
4 2 1

Per-VM Maximum Heap Size


2GB 2.5GB 4GB

Total Heap for 4vCPU Case

8GB 5GB 4GB

Best case

If given 4 vCPUs, which VM Configuration is better?


JVM-1 Web JVM-1 Web JVM-2 Web JVM-1 JVM-2 JVM-3 JVM-4 Web Web Web Web

1 VM
4vCPU 1 JVM 4GB

2 VMs
2vCPU on each VM 2 JVMs 2.5 GB each

4 VMs
1vCPU on each VM 4 JVMs 2 GB each

87

Most Common Sizing and Configuration Question

Option -1 JVM/VM Horizontal Scalability


(best option)
4 off 2 vCPU VMs 1 JVM on each VM 4GB Heap on each JVM

JVM-1 Web

JVM-2 Web

JVM-3 Web

JVM-4 Web

Option -2 JVM Vertical Scalability


(second best option)
2 off 2 vCPU VMs 1 JVM on each VM 8GB Heap on each JVM Reduce JVM/VM sprawl
JVM-1
Web

JVM-2
Web

88

Most Common Sizing and Configuration Question

Option -3 JVM Stacking (third best option) 1 off 4 vCPU VMs 2 JVMs on each VM 4GB Heap on each JVM Reduce JVM/VM sprawl Reduce OS count Increased number of JVM instances on a single OS While JVM to vCPU ratio is still 1 JVM to 2 vCPU it is still not as prudent as Option-1
JVM-1 JVM-2 Web Web

It would likely not perform if you configured a 2vCPU VM In busy JVMs GC would require 1vCPU while other user
transactions would take additional available vCPUs

89

Performance Perspective
2 VMs 2 vCPU each, 2.5G RAM on each VM showed best Throughput for the amount of memory used and R/T achieved

90

Performance Perspective
Scaling of Peak Throughput

2 vCPU VMs shows best scalability when horizontally scaled.

91

FILL OUT A SURVEY


EVERY COMPLETE SURVEY IS ENTERED INTO DRAWING FOR A $25 VMWARE COMPANY STORE GIFT CERTIFICATE

APP-CAP1426

The Benefits of Virtualization for Middleware

Jeff Battisti, Cardinal Health Emad Benjamin, VMware, Inc.

#vmworldapps