Documente Academic
Documente Profesional
Documente Cultură
Table of contents
Executive summary............................................................................................................................... 2 ccNUMA: an introduction ..................................................................................................................... 4 ccNUMA infrastructure ..................................................................................................................... 5 HP-UX 11i v3 and LORA mode ....................................................................................................... 6 Virtualization, CLM, and LORA mode .............................................................................................. 7 Oracles NUMA optimizations .............................................................................................................. 9 Enabling or disabling the Oracle NUMA optimizations ........................................................................ 9 Determining the ccNUMA configuration............................................................................................ 10 How the optimizations work ............................................................................................................ 11 Configuring the server and Oracle to match one another .................................................................... 11 A clear case for NUMA optimization: small vPar in large nPar ............................................................ 13 Dynamic reconfiguration considerations with ccNUMA optimizations ................................................... 14 Summary .......................................................................................................................................... 15 Appendix: Oracle versions, default NUMA enablement ......................................................................... 16 For more information .......................................................................................................................... 17
Executive summary
Many HP servers are based on a ccNUMA (cache-coherent Non-Uniform Memory Access) architecture; these servers provide optional features which allow programmers to optimize application performance by tailoring their application to the non-uniform nature of the underlying host server. Depending on the application and workload, by taking advantage of these features and reducing higher-latency memory references, modest performance gains can be achieved. Starting with release 10g, the Oracle database includes such NUMA-optimization features; these can be enabled or disabled at database startup time, and they are enabled by default on some Oracle versions, and disabled on others. To be effective, these optimizations require that the servers memory is allocated mainly to the individual locality domains (cell-local memory or socket-local memory) rather than to the system as a whole (interleaved memory); newer HP ccNUMA servers default to such a configuration, but older ccNUMA servers do not. It is very important for performance reasons that the setting of Oracles NUMA optimizations matches the configuration of the server: never rely on the default settings and configurations to produce an optimal result. Furthermore, dynamic resource reconfiguration will have significant consequences to any running NUMA-optimized Oracle Database instances. Because the optimizations are configured by Oracle only at the time the instance starts up, subsequent changes to the structure (number of locality domains) of the underlying host will, at best, result in the instance being optimized for the wrong structure (which will incur sub-optimal performance); at worst, they may cause the database instance to fail. Thus, if dynamic reconfiguration is to be undertaken, either: disable Oracles NUMA optimizations, or manage any dynamic reconfiguration very carefully, following the recommendations spelled out in a companion white paper, Dynamic server resource allocation with Oracle Database 10g or 11g on an HP-UX ccNUMA-based server
Table 1. Summary of rules governing Oracle NUMA optimization with dynamic system reconfiguration
Oracle NUMA optimization disabled or off (ensure adequate interleaved memory for Oracle SGA) enabled (ensure adequate cell-local memory in each domain for Oracle SGA)*
OK
OK
OK with restrictions (see companion white paper) but generally not recommended.
OK
Applicable servers:: All ccNUMA-based HP-UX servers (all cell-based servers and all servers based on the Intel Itanium 9300 processor) which are running Oracle Database 10g or later. Target audience: Oracle DBAs and those who administer ccNUMA HP-UX systems which will host Oracle databases.
Support: This whitepaper makes recommendations based on HPs best knowledge of the stated configuration options at this time (June 2010). We do not and cannot imply Oracle support for any of these configuration options statements of Oracle support can only be made by Oracle.
Figure 1. Traditional Uniform-Memory-Access (UMA) design. All processors, and all processes (P1, P2, etc), have equal latency to their objects in any part of memory. The server is scaled up by connecting more processors to the bus (which eventually causes the bus to be a bottleneck and reduce scalability).
ccNUMA: an introduction
Cache-coherent non-uniform memory access (ccNUMA) is an architectural technique which has been used successfully to build large, multi-processor servers in a way that avoids the bus bottlenecks which characterize traditional uniform server design (see Figure 1). HPs first generation of ccNUMA servers are cell-based: they are comprised of multiple smaller cells of CPUs, memory, and I/O cards linked together by at least one low-latency cross-bar switch. These individual cells can be operated independently or in groups of two or more cells; each such operating unit is known as an nPar or hardware partition. When multiple cells are incorporated into the same hardware partition, processes running in any cell can access memory owned by any other cell (Figure 2). Accessing memory from a remote cell results in slightly greater latency than an access to memory in a process own cell. Since memory-access time is a component of CPU-busy time, an increase in memory latency equals an increase in CPU-busy time.
Table 2. HP servers based on ccNUMA architecture
Cell-based ccNUMA servers rp7410, rp7420, rp7440, rx7620, rx7640, rp8400, rp8420, rp8440, rx8620, rx8640, HP 9000 Superdome, HP Integrity Superdome
Itanium 9300-based ccNUMA servers rx2800 i2, BL860c i2, BL870c i2, BL890c i2, Superdome 2
In 2010, HP introduced new Integrity servers based on the latest Intel Itanium processor (the fourcore/eight-hyper-thread Itanium 9300 series). Like their predecessors, these servers are modular in nature: the blade-based two-socket BL860c i2 can be used by itself, doubled to form a four-socket server (BL870c i2), or quadrupled to form an eight-socket server (BL890c i2). A non-modular rackmounted version of the two-socket server (rx2800 i2) is also available. The Itanium 9300 is designed so that each processor socket has its own memory; all memory in these servers is allocated (more or less) evenly among the sockets. In a two-socket server, a process will have faster access time to objects in the memory on the same socket as the core where the process is running than it would to objects in the memory of the other socket. When multiple two-socket blades are lashed together to form larger servers, in addition to the latency differential between the sockets in the same blade, there are further increases in latency when accessing memory objects on the other blades which comprise the server. So while previous ccNUMA servers featured locality domains based on the cell concept, these newest ccNUMA servers locality domains are based on their socket layout. ccNUMA-based servers provide excellent performance and scalability without requiring that workloads be adapted in any way for the ccNUMA architecture. However, certain application workloads may realize small or even modest performance gains if they are made aware of the servers hierarchical memory layout and are able to adapt themselves to minimize interactions between the locality domains (the cells or sockets) that comprise the server. By restricting a process to a certain domain and assigning its memory objects to the memory of that same domain, a process will minimize memory latency. Processes which do large amounts of data manipulation in memory are thus candidates for ccNUMA optimization.
Figure 2. Non-Uniform Memory Access. Small UMA servers (cells) are connected together through one or more high-speed switches to form a larger server. Any process has access to any region of memory, but latency times are lower if the object is closer to the accessing process. In this example, P1s access to its object is faster than P2s, which is faster than P3s.
ccNUMA infrastructure
While the memory in a ccNUMA server is organized in a hierarchical manner, applications are traditionally designed with the assumption that all parts of memory are equal. Thus, some or all memory in a ccNUMA server can be designated as interleaved memory (ILM), which is designed to emulate, as much as possible, a uniform architecture. Interleaved memory is managed as a single unit across all locality domains in the server or partition; an object placed in ILM will be striped across all locality domains, and access times, on average, will be the same regardless of where (in which locality domain) the accessing process is located. The alternative to configuring memory as ILM is to configure it in its native ccNUMA state, which is called Cell-Local Memory (CLM), or sometimes Socket-Local Memory (SLM). CLM/SLM is configurable on a per-cell/per-socket basis; an object placed in CLM (SLM) in a cell or socket will be completely contained within that cell or socket. In general, one would only place objects in CLM if the applications which access those objects have been optimized for ccNUMA (i.e., if the applications processes can be localized to a single locality domain); for non-optimized applications, ILM would usually be the best choice. The memory in HP ccNUMA servers is configurable as ILM or CLM; this configuration is set at powerup and is fixed while the server is running. The server management interface can be used to view and modify the ILM/CLM configuration, with any changes taking effect at the next reboot. The default configuration depends on the particular server model: for Itanium 9300-based servers, the default is 87.5% CLM, but for all other servers, the default is 100% ILM. It is important to know the ILM/CLM configuration of your server.
A note about terminology: the earliest ccNUMA servers were all cell-based, so its not unusual to refer to the individual locality domains as cells, which is why we have Cell-Local Memory as opposed to Domain-Local Memory (though the term Socket-Level Memory is sometimes used). Today, we use the term locality domain (or LDOM, or sometimes just locality or domain) to mean cells or sockets any set of resources which have equal latency to a region of memory. Furthermore, we might refer specifically to a locality domain by its number: LDOM 0 through LDOM n.
Figure 3. Interleaved Memory (ILM) is spread across the Locality Domains (LDOMs); Cell-Local Memory (CLM) is contained within each LDOM. An object placed in ILM will thus be striped across the LDOMs; a process accessing it will find some accesses to be local (same LDOM), others remote. An object placed in CLM will be local for all processes in that same LDOM (but all accesses from remote LDOMs will have higher latency). In this example, P1 is accessing an object in ILM (with varying access times since one part of the object is in the same LDOM, but most of it is in remote LDOMs) and an object in the CLM in its own LDOM (where all accesses are local). When P2 accesses the object in LDOM 0, all accesses are remote.
The purpose of LORA_MODE is to provide a way of indicating to any optimized applications that the server has sufficient CLM to support those optimizations. (An application that is optimized for ccNUMA requires sufficient CLM in which to place its memory objects; without enough CLM, the optimizations would have no or perhaps even a detrimental effect on performance.) To determine the value of LORA_MODE, use the HP-UX getconf command: getconf LORA_MODE As will be discussed in more detail below, Oracles most recent database version, 11gR2, checks LORA_MODE before engaging its optimizations. An additional HP-UX 11i v3 parameter which affects ccNUMA behavior is numa_policy, which governs the allocation of memory (in CLM or ILM). The valid settings of numa_policy are: 0: (default): autosense the right policy based on the (LORA) mode in which HP-UX is operating 1: (default in LORA mode 1): honor requests from the application, otherwise place the object in the CLM of the domain where the process is most likely to run 2: override requests; always allocate in ILM if possible 3: honor requests by the application, otherwise allocate in the CLM of the closest domain, except for text/library objects, which will be placed in ILM if possible 4: (default for LORA mode 0): allocate non-LORA-intelligently (i.e., favor ILM for shared objects and CLM for private objects, but honor requests made by the application)
In short, if the application specifies the location of a newly created object, that location will generally be used unless numa_policy=2, or unless theres not enough CLM or ILM available, in which case the object will be placed in the next-best location. If the application does not specify the location of the object, it will default according to the setting of numa_policy. The HP-UX command kctune can be used to determine the current setting of either numa_mode or numa_policy: kctune numa_mode kctune numa_policy
memory is configured as CLM. When you start the vPar for the first time, verify your CLM/ILM amounts, and (for HP-UX 11i v3) the state of the LORA_MODE variable (using getconf as described above). The HP-UX machinfo command will tell you how much total memory and interleaved memory that HP-UX sees: machinfo m PSETs will also affect the ccNUMA characteristics seen by any applications running within them. A PSET that is wholly within a single locality domain will cause any applications running within it to behave as if they were running on a non-ccNUMA server. Likewise, if an nPar or vPar or does not span multiple domains, then it will be considered a non-ccNUMA server, even if that partition is part of a larger server that itself comprises multiple domains. HP Integrity Virtual Machines are the lone exception: VM guests will always appear as a single locality domain, regardless of the layout of the underlying physical infrastructure. One last fact about CLM and ILM with vPars: while CLM for each locality in the vPar is obviously contained within those localities, ILM will be interleaved across ALL the localities that constitute the server (or nPar) of which the vPar is a part. An example will help illustrate this fact: imagine an nPar of eight cells that contains a two-cell vPar (see Figure 4). You might expect that the ILM for this vPar would only span cells 0 and 1, but actually, it will span all eight cells so when a process in our vPar needs to access an ILM-based object, it will wind up accessing memory in cells which are not even part of our vPar. (And as you can see, some of the memory in cells 0 and 1 are allocated to ILM that will be available to objects belonging to other vPars, which means that we will have to share some of our memory bandwidth with processes from outside our own vPar.) This last fact may surprise you if you thought that by setting up a vPar, your computing activity would be completely isolated within the vPar, youve just learned that you were wrong activity on every vPar DOES affect performance on all vPars because of this ILM characteristic. The performance advantages of using NUMA optimization to reduce the use of ILM (or, at the very least, of forcing objects to be placed in CLM instead of ILM) should thus be very clear. We will discuss this issue and how Oracle is affected once weve discussed Oracles NUMA optimizations in general.
Figure 4. An example of an eight-cell nPar (or server) with a vPar consisting of cells 0 and 1. Note that while the vPars CLM is contained completely within cells 0 and 1, the vPars ILM is a portion of the overall ILM allocated to the entire nPar/server - it is interleaved across all the cells.
Oracle version 11gR2 uses a different underbar parameter: _enable_NUMA_support. This parameter is FULLY supported by Oracle (even though it starts with an underbar!). NOTE: when upgrading to 11gR2 from a previous release of Oracle, make sure to remove any references to the old parameter, _enable_NUMA_optimization. This parameter is unfortunately not ignored by Oracle 11gR2 it still affects some aspects of the optimizations and should thus be avoided. When a NUMA-optimized Oracle 11gR2 instance is started up, a message indicating the use of NUMA optimizations (NUMA system found and support enabled) is displayed on the console and in the instances alert log. (No such indication is given for older Oracle versions.) A complete description of the particular NUMA behaviors associated with each version of Oracle Database is included as Table 3. For Oracle Database version 10.2.0.4, the optimizations are not implemented correctly and should never be used: __enable_NUMA_optimization should always be set to false (and the server should thus be configured ILM-heavy) for Oracle 10.2.0.4.
Table 3. Oracle NUMA behavior. Binding of none indicates that Oracle does not restrict the location of the process(es).Parallel Query Slaves are created and bound one per LDOM in a round-robin (RR) fashion when in NUMA mode. With NUMA optimizations off, the location of the SGA (including the DB cache) is not specified by Oracle, so it defaults to the location implied by the HP-UX parameter numa_policy. In addition, the optimizations will not be engaged if Oracle detects only one locality domain (cell or socket) when the database is started. See HP-UX 11i v3 and LORA mode, above, for definition and default behavior of numa_policy.
ing
lo c a ti on
sa re
at i on
bin d
ve r sio n
ara me te
ery
ffe rb bu og oth
ind
none none none
LO RA _M OD
z at ion
loc
in g ind wr b bw r
NU MA p
_c a che
GA
fixe d_S
Ora c le
tim i
alle l
none
qu par
none none
default
none
engaged CLM (even) default disabled default unengaged default default default
_eNo = _enable_NUMA_optimization (10gR2 and 11gR1) _eNs = _enable_NUMA_support (11gR2) depends on numa_policy: either ILM, or CLM of a single LDOM lcpu = logical CPU count {= # cores, times 2 if hyperthreading enabled} *10.2.0.4 only: optimizations are broken. _eNo should ALWAYS be set to false
Thus, while all versions of Oracle do check to make sure that the server has more than one locality domain available before engaging optimized behavior, only version 11gR2 checks whether the server is configured with adequate CLM (LORA_MODE is 1) to render the optimizations effective. If the server has no CLM and the optimizations are enabled anyway, Oracle will function normally but the optimizations will be useless see below. Recall that the default configuration for many HP servers is zero CLM, so it is important to explicitly configure your server with enough CLM if you plan to use Oracles NUMA optimizations.
10
lgw r, l
#d
op
db
db
Since the NUMA configuration is determined only at instance startup, Oracle cannot respond to dynamic system reconfiguration. See Dynamic reconfiguration considerations with ccNUMA optimizations, below, for further discussion.
11
disabled by default for the latest Oracle versions). We strongly recommend that the defaults never be taken (or at least that the configurations be explicitly examined to make sure they are well aligned). For NUMA-optimized behavior When Oracles NUMA optimizations are desired, it is critical to make sure that sufficient CLM is allocated on the system. The default configuration for HP cell-based servers 0% cell-local memory - should be changed on any multi-cell server that will host a NUMA-optimized Oracle database. To take advantage of Oracles NUMA optimizations, configure the system with both cell-local memory and interleaved memory (ILM) Oracle uses a tiny amount of ILM even when the NUMA optimizations are engaged (and HP-UX requires some ILM as well). When insufficient cell-local memory is available in a locality domain, the Oracle memory structures are created in interleaved memory (across all domains) and/or CLM in other localities, where they will still be accessible by Oracles processes, but where most memory accesses will be non-local. So while the Oracle instance will function completely normally, it will be expending the overhead required to eliminate inter-domain accesses, but it cannot be successful: the overhead will be wasted, and a performance penalty will result. How much cell-local memory should be configured on a system? The answer depends largely on two factors: HP-UX requirements. Starting with HP-UX version 11.31 (11i v3), the operating system itself is optimized to use cell-local rather than interleaved memory, so it is important to follow HP-UX standard recommendations for configuring cell-local memory. If you are upgrading from an earlier version of HP-UX, be sure to take into account the higher cell-local memory recommendations for 11.31. Combined memory requirements of all Oracle instances. As noted above, Oracle instances with NUMA optimizations enabled will request CLM in the amount of the SGA (System Global Area) for the instance. A given server should be configured with CLM at least as large as the sum of all NUMA-enabled Oracle instances that will be run on that server. Additional information about CLM and ILM memory recommendations can be found in the white paper Locality-Optimized Resource Alignment (see For more information section, below). Typically, HP recommends 87.5% (7/8) of all memory be configured as CLM, to account for both OS and application needs; this is the default value for the new Itanium 9300-based Integrity servers. Moreover, the 87.5% setting is the level at which HP-UX 11.31 will set the LORA_MODE configuration variable to a value of 1, which will allow Oracle 11gR2 to engage its NUMA optimizations. In general, when setting up partitions (nPars, vPars, PSETs) which will host an Oracle database, include the minimum number of domains required to satisfy Oracles resource requirements, and keep the domains relatively well balanced (equalize memory and processors). Be sure to allocate CLM and ILM per the instructions above. When configuring nPars/vPars, consider the physical location of the I/O cards used to access that partitions devices: it would make sense to configure your partition with dedicated cores from the locality domain(s) associated with the I/O bay(s) containing those cards, in order to avoid cross-locality accesses when processing I/O interrupts. For non-NUMA-optimized behavior If your Oracle instance will be run with its NUMA optimizations disabled, Oracle will not make use of CLM at all, and will need sufficient interleaved memory (ILM) for all its data structures. In this case, be sure to configure your system to be ILM-heavy, and, if you are running an Oracle version prior to 11gR2, set _enable_NUMA_optimization to false to ensure that the optimizations are disabled. Oracle 11gR2 will properly detect ILM-heavy configurations (by detecting LORA_MODE equal to 0)
12
and automatically disable NUMA optimizations regardless of the value of the _enable_NUMA_support parameter. When an unoptimized Oracle instance requests space for its SGA, the space will be allocated according to the setting of numa_policy. If numa_policy is set to favor ILM (the default for LORA_MODE 0) but the system doesnt have enough ILM, the space will be allocated from the celllocal memory of the first LDOM that has it available. Likewise, if numa_policy is set to favor CLM (the default for LORA_MODE 1), the SGA will be placed in the CLM of a single LDOM. In either case, this will result in an imbalance in the available memory distribution across the LDOMs, and HP-UX will favor the remaining LDOMs whenever new processes are created. Thus, the SGA will be in one LDOM, while most of Oracles processes will be in the other LDOMs, practically guaranteeing that most memory accesses will be non-local. Clearly, when running Oracle with its NUMA optimizations disabled, its important to make sure that theres plenty of ILM (and that numa_policy is set to favor ILM). Configuring your server and your Oracle instance: summary While a CLM/ILM configuration that does not match the NUMA optimization state of the Oracle instance will not impede Oracles proper operation, it will result in sub-optimal performance. This implies a clear best practice when deploying Oracle instances on HP ccNUMA servers: make sure that the systems configuration and Oracles configuration match one another. Configure BOTH the server and Oracle for ccNUMA (lots of CLM; enable Oracles optimizations) or configure them both to operate without the optimizations (lots of ILM; disable Oracles optimizations). The best practice for Oracle Database 11gR2 is somewhat simpler: set _enable_NUMA_support to TRUE and then the instance will properly enable or disable the optimizations based on HP-UXs LORA_MODE setting. (As previously mentioned, do not assume that the default settings are optimal!)
13
Figure 5. An eight-cell nPar (or server) with a vPar, consisting of cells 0 and 1, running a non-NUMA-optimized Oracle instance. The SGA will be located in ILM, interleaved across all eight cells (including six cells that are otherwise not part of the vPar).
Performance of both our database and of the other applications in other cells would clearly benefit from turning our Oracle Databases NUMA optimizations on, because Oracles SGA will be placed in the CLM of the two cells in our vPar.
14
Summary
Oracles NUMA enhancements can improve performance when running database instances on HP ccNUMA servers. Since the memory configuration of the underlying server is critical to the performance of an Oracle instance, care must be exercised to ensure that the memory configuration matches the NUMA mode of the instance (sufficient CLM must be configured for a NUMA-optimized instance, whereas a non-NUMA instance requires sufficient ILM). Make it a point to override the defaults (if necessary) and match the server configuration with the Oracle configuration: decide whether you wish your Oracle instance to run with its NUMA optimizations on or off, then configure BOTH Oracle AND your server accordingly!
15
11gR1 (11.1.0.6 and 11.1.0.7) : optimizations (_enable_NUMA_optimization) enabled by default but Oracle recommends patch 8199533 to switch optimizations off by default 11gR2: optimizations controlled by the (supported) parameter _enable_NUMA_support (but if the underlying hosts LORA_MODE parameter is set to 0, the optimizations will not be used) 11.2.0.1 : optimizations disabled by default When upgrading to 11gR2 from an earlier version, make sure to DELETE any references to the obsolete parameter _enable_NUMA_optimization to avoid potential issues.
16
Copyright 2009 2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. Oracle is a U.S. registered trademarks of Oracle Corporation. Intel and Itanium are trademarks of Intel Corporation in the U.S. and other countries. 4AA2-4194ENW, Created January 2009; Updated September 2010, Rev. #1