Documente Academic
Documente Profesional
Documente Cultură
Introduction
While I was working on one of my previous presentations, RAC Connection
Management, I realized that the topic is too broad for a single presentation and,
probably, requires a full day class to cover all details of connection management in
Oracle. I was just able to cover the basics in my 45-60 minutes presentations and give a
bit of overview for more advanced topics.
One of the areas that attracted most of the interest was workload balancing using new
Oracle 10g features such as Oracle Cluster Services, Fast Application Notifications
(FAN) and Load Balancing Advisory (LBA). This paper focuses on internal
implementation details and possible pitfalls rather than how-to instructions found in the
manuals. This should assist in troubleshooting and let you understand the technology
better.
Its assumed that reader is aware of general RAC architecture as well as has basic
understanding of connection management in RAC - how connections are established
and failed over, what are the role of client process, listener and Oracle instance.
Average throughput
Different applications have different criteria of efficiency and it depends a lot whether its
OLTP, data warehouse, batch or reporting functionality.
Programming languages
You might think that regardless of application development languages, database
workload management techniques would stay the same. Yes and no.
-1
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
OCI - Oracle Call Interface. Its used in C and C++ applications as well as some other
languages that are built on top of OCI layer such as Perl and PHP.
Thick JDBC - this is Java libraries that are built around OCI libraries to implement
JDBC API standard for Java language.
Native JDBC or Thin JDBC - this is pure Java implementation that doesnt need
underlying OCI layer.
ODP.NET - Oracle Data Provider for .Net. This is Oracle provided drivers for Microsoft
.Net environment to be used, typically, from C#, VB.NET. Its actually based on OCI
with some additional features and .Net integration.
When we get to the run-time workload balancing, the examples of run-time load
balancing and ONS API are using thin JDBC driver. OCI and ODP.NET specifics can be
found in [BLUNDHILD].
For the purpose of this presentation you will need to distinguish the steps of connection
process and few basic principles.
Client connection descriptor for a RAC database will be in the following form regardless
of client drivers:
(DESCRIPTION=
(FAILOVER=ON)
(ADDRESS_LIST=
(LOAD_BALANCE=ON)
(ADDRESS=(PROTOCOL=TCP)(HOST=lh1-vip.oracloid.com)(PORT=1521))
(ADDRESS=(PROTOCOL=TCP)(HOST=lh2-vip.oracloid.com)(PORT=1521))
(ADDRESS=(PROTOCOL=TCP)(HOST=lh3-vip.oracloid.com)(PORT=1521))
(ADDRESS=(PROTOCOL=TCP)(HOST=lh4-vip.oracloid.com)(PORT=1521))
)
(CONNECT_DATA=(SERVICE_NAME=service10g.oracloid.com))
)
-2
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
There are several key points you should know in order to make your application
connection descriptors RAC aware:
Several connection points should be referenced - you want to include all listeners of
your RAC cluster or enough of them so that there is no case when all of them are
down at the same time.
You want to use VIP addresses. See [JMORLE] on why you should do so.
Here is what happens when client issues a connection request using the descriptor
above:
If connection attempt fails (no listener, network time out, listener doesnt know of
requested service_name) then another address is chosen randomly.
James Morle in [JMORLE] concluded that client-side connection load balancing seems
to provide pretty much uniform distribution and I could only confirm it with my
observations.
I have already mentioned and I want to re-iterate once again that client-side connection
load balancing is done by Oracle client that randomly chooses an address from address
list of a connection descriptor. This causes connection requests to be distributed across
all referenced listeners.
Now, depending on listener configuration, the listener itself can forward connection
request to any of the instances providing requested service if remote listener
configuration is in place, more about which is later when we discuss Oracle Cluster
Services.
-3
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
1. Each listener is only aware of instances and, consequently, services available on the
same node that the listener is running on. This is called local-only listener
registration. Oracle instance reports only to the locally running listener. When
listener receives connection request, it can only be established with the local
database instance. If local instances dont provide requested service, an error is
returned that listener is not aware of any instances providing requested service. The
connection refusal will cause client to try another address thanks to FAILOVER=ON.
Using *only* local listener registration will leave only client-side connection
load balancing in place.
The best way to understand Oracle Cluster Services concept and capabilities is to use
an example.
Lets take a simple example of online orders processing system. The most important
business function is taking customers orders. It has direct end-user impact and if
system is not available to take new orders, its a direct hit to the revenue stream.
Another functionality is web content display which doesnt hit the database directly but
goes from application cache refreshed asynchronously from the database. Visitors can
also leave feedback on orders and items which is more critical than content refresh as
outage has direct end-user impact. However, its not as important for business as taking
orders. There is a separate orders processing application - back end application that is
-4
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
used by internal users for orders handling and shipment. The last piece is data
extraction process for a corporate financial application and few other batches.
What we can do is to create several services in Oracle database that are mapped to the
application functions we identified. Based on the figures from capacity planning, we
know how much CPU capacity we need for each component so its possible to arrange
workload with affinity to certain nodes to minimize RAC overhead due to Cache Fusion.
When services are created in Oracle database, we can specify which instances provide
particular service by default. Those are called preferred instances. We can have several
instances providing same service as well as several services provided by the same
instance.
We can also define when the service can run in case one of the preferred instances is
not available. Those are potential (or backup) instances. In Oracle Cluster Services
terminology, these instances are available to run the service.
For our example, we create NEW_ORD service on all instances and make DB1 and
DB2 instances preferred, while DB3 and DB4 are available to take over. CONTENT
service requires most of the resources and runs on 3 nodes and has the fourth node as
available. Order processing and batch data extraction services are concentrated on one
node to avoid impact on more sensitive business functions.
DB1
NEW_ORD DB2
NEW_ORD DB3
NEW_ORD DB4
NEW_ORD
PROC_ORD PROC_ORD
BATCH BATCH
Oracle Clusterware will automatically bring service up on one or more of the available
instances in case one or more preferred instances become unavailable. If instance DB1
-5
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
becomes unavailable, Oracle will shift NEW_ORD to one of DB3 or DB4 and bring
CONTENT service up on DB4.
Which component of Oracle RAC software stack controls services placement? Prior to
Oracle 10g, DBA could manually control services provided by each instance using
SERICE_NAMES init.ora parameter. However, no automation and notion of preferred and
available instances. 10g still provides this possibility but, with introduction of
Clusterware, DBA can now create CRS resources for services. Preferred and available
instances are defined in Clusterware and CRS component is responsible for monitoring
services availability and failover services as required.
The rules of services failover is very simplistic. For example, Oracle Clusterware doesnt
automatically shift services back when preferred instance comes back to the cluster.
Server-side FAN callbacks mechanism allows to implement more complex algorithms.
For example, we can implement the rule which will stop BATCH service in case
CONTENT or NEW_ORD services are running on DB4 instance.
Node affinity
As I mentioned already, we can arrange services in such a way that minimizes Cache
Fusion impact in RAC. Lets say that online orders processing application wasnt
designed very well, just like in real life. Scaling an ill-designed application by moving it to
RAC is one of the most disastrous projects I can imagine.
Luckily, three functional areas (NEW_ORD, CONTENT, FEEDBACK) work mostly with
their own subsets of database tables with very light overlap. If each of the three services
can be satisfied with capacity of a single node, we can distribute services in the
following way:
-6
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
DB1
NEW_ORD DB2 DB3 DB4
NEW_ORD
CONTENT CONTENT
FEEDBACK FEEDBACK
PROC_ORD PROC_ORD
BATCH
If instance DB1 becomes unavailable, Oracle Clusterware will move NEW_ORD service
to DB4 instance which is available for NEW_ORD service. However, Order Processing
back-end application and batches are running on that node and that would negatively
impact new orders placement response time. One of the workarounds is manual DBA
intervention who can stop BATCH and PROC_ORD services on DB4 instance since
they are not critical but it might take a while for DBA to react.
What if we had a mechanism to automate this manual DBA response? This is where
Server-Side FAN Callouts come to play.
We are going to cover FAN Events in more details soon but for now its suffice to know
that Oracle Clusterware keeps track of certain events on the database server including
events when instance fails, starts or stops and services coming up and down on each
node.
Here is an example. I will just use 3 nodes and skip FEEDBACK service. First, lets
create 4 services ord_new, content, proc_ord and batch:
-7
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
My database name is g4 and I have 3 instances - g41, g42 and g43. Lets start them
and check the status:
Service new_ord down for instance g41 on node lh1; reason - failure (instance is
gone)
Service new_ord up for instance g43 on node lh3; reason - failure (relocated from lh1)
The result is that instance g43 hosts critical new_ord service as well as 2 other heavy-
weights - proc_ord and batch. What we want for our automated response is to stop
proc_ord and batch services in case either new_ord or content services are started on
instance g41 as the result of a failure and not direct user request. Indeed, we still want
to be able to manipulate services manually without automation kicking in.
How do I know what events are triggered? Besides the reference ([ORACADG] Chapter
4, section Fast Application Notification High Availability Events), we can create a simple
FAN callout script to log all events:
-8
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
#! /bin/sh
FAN_LOGFILE=/nfs1/oracle/product/10.2.0/crs/log/`hostname`/racg/
fan_ha_events.log
echo "
$* 'reported='`date`" >> $FAN_LOGFILE &
and on lh3.oracloid.com:
#!/bin/sh
ORA_CRS_HOME=/nfs1/oracle/oracle/product/10.2.0/crs
SRVCTL=$ORA_CRS_HOME/bin/srvctl
LOG=$ORA_CRS_HOME/log/service_rebalance.log
EVENTTYPE=$1
for ARGS in $* ; do
PROPERTY=`echo $ARGS | awk -F"=" '{print $1}'`
VALUE=`echo $ARGS | awk -F"=" '{print $2}'`
case $PROPERTY in
VERSION|version) VERSION=$VALUE ;;
SERVICE|service) SERVICE=$VALUE ;;
DATABASE|database) DATABASE=$VALUE ;;
INSTANCE|instance) INSTANCE=$VALUE ;;
HOST|host) HOST=$VALUE ;;
STATUS|status) STATUS=$VALUE ;;
REASON|reason) REASON=$VALUE ;;
CARD|card) CARDINALITY=$VALUE ;;
TIMESTAMP|timestamp) LOGDATE=$VALUE ;;
-9
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
??:??:??) LOGTIME=$PROPERTY ;;
esac
done
fi
The script parses the arguments and checks for matching events. If event is that one of
services new_ord or content is up on instance g43 as the result of a failure then it stops
proc_ord and batch services. Here it the result from the log (formatted to readability):
lh1.oracloid.com Fri Feb 8 13:22:46 EST 2008: Service new_ord is up
on instance g43 due to a failure. Stopping proc_ord and batch services...
/nfs1/oracle/oracle/product/10.2.0/crs/bin/srvctl stop service -d g4 -i g43
-s proc_ord
/nfs1/oracle/oracle/product/10.2.0/crs/bin/srvctl stop service -d g4 -i g43 -s batch
/nfs1/oracle/oracle/product/10.2.0/crs/bin/srvctl status service -d g4
Service service10g is not running.
Service new_ord is running on instance(s) g43
Service content is running on instance(s) g42
Service proc_ord is not running.
Service batch is not running.
You can also use server-side FAN callouts to relocate services back to the preferred
instances as soon as they become available. This and some other examples you can
find in [OTNSAMPLE].
- 10
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
Is there a way to avoid such relatively complex manipulations? This example has only
three nodes and four services but imagine a ten nodes cluster with 42 services. That
gets more complex than a chess game as one of my customers said. This is when
Resource Manager comes in handy.
Resource Manager is an Oracle feature that allows a DBA to prioritize workload within
Oracle database instance. The purpose of workload manager is to favor more critical
business functions when resources are sparse.
BEGIN
DBMS_RESOURCE_MANAGER.SET_CONSUMER_GROUP_MAPPING
(DBMS_RESOURCE_MANAGER.SERVICE_NAME, 'NEW_ORD',
'NEW_ORD_GROUP');
DBMS_RESOURCE_MANAGER.SET_CONSUMER_GROUP_MAPPING
(DBMS_RESOURCE_MANAGER.SERVICE_NAME, 'CONTENT',
'CONTENT_GROUP');
DBMS_RESOURCE_MANAGER.SET_CONSUMER_GROUP_MAPPING
(DBMS_RESOURCE_MANAGER.SERVICE_NAME, 'FEEDBACK',
'FEEDBACK_GROUP');
DBMS_RESOURCE_MANAGER.SET_CONSUMER_GROUP_MAPPING
(DBMS_RESOURCE_MANAGER.SERVICE_NAME, 'PROC_ORD',
'PROC_ORD_GROUP');
DBMS_RESOURCE_MANAGER.SET_CONSUMER_GROUP_MAPPING
(DBMS_RESOURCE_MANAGER.SERVICE_NAME, 'BATCH',
'BATCH_GROUP');
END;
- 11
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
Using Resource Manager API, we should configure resource allocation for those groups
in priorities according to the business impact. I.e. NEW_ORD_GROUP and
FEEDBACK_GROUP would have the highest priority followed by CONTENT_GROUP
and then by PROC_ORD_GROUP and BATCH_GROUP. For more details on consumer
groups mapping see [ODAG] Section Specifying Session-toConsumer Group Mapping
Rules.
Please note that Resource Manager is not RAC-aware meaning that it can only control
resources allocation between consumer groups on each instance in isolation and not
across the whole cluster. Another limitation is that Resource Manager works inside one
instance only so if you have two or more instances on the node, it wont be able to take
that into account.
As you probably know, Oracle 10g has the new set of features known under the
common name Automatic Workload Repository (AWR). I think I can safely say that
AWR is a hybrid of Statspack and 10046 trace, in a nutshell. AWR captures a lot of
statistics with very detailed granularity. AWR also provides several aggregated views on
this data and one of them is service-aggregated perspective.
Here are some of the Oracle views with service-aggregated performance data:
V$SERVICE_EVENT
V$SERVICE_STATS
V$SERVICE_WAIT_CLASS
V$SERVICEMETRIC
V$SERVICEMETRIC_HISTORY
DBA_HIST_SERVICE_%
How does application know where the service is running now? Well, the answer is - it
doesnt. Recall that client-side connection load balancing picks a listener randomly.
What happens next is the listener directs connection request further to one of the
instances providing requested service so each listener should know which services are
available on each node of the cluster.
- 12
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
PMON will notify the listeners when services become available on its instance. Jumping
a little forward, I should say that listeners also subscribe to HA events and can quickly
clean up services from failed instances so that new connection requests are not routed
there.
We can use lsnrctl status (or lsnrctl service for more details) to display
information about services that listener is aware of:
- 13
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
We can see from the output above that the listener on lh3 node is aware of all services
be it local (proc_ord and batch) or remote (new_ord and content) service.
- 14
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
There is also a default service with the same name as DB_NAME, g4, that is up on all
nodes by default. We shouldnt touch this service normally and shouldnt modify it so if
some of your workload must be distributed across all instances of the cluster, dont use
default service but rather create a new one with all instances as preferred.
At this point its probably appropriate to ask a very important question - how does
listener decide which instance should it assign for a new connection if more than one
are available for requested service?
LISTENER_LH1 =
(ADDRESS = (PROTOCOL = TCP)(HOST = lh1-vip)(PORT = 1521))
LISTENER_LH2 =
(ADDRESS = (PROTOCOL = TCP)(HOST = lh2-vip)(PORT = 1521))
LISTENER_LH3 =
(ADDRESS = (PROTOCOL = TCP)(HOST = lh3-vip)(PORT = 1521))
LISTENERS_LH =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = lh1-vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = lh2-vip)(PORT = 1521))
(ADDRESS = (PROTOCOL = TCP)(HOST = lh3-vip)(PORT = 1521))
)
When everything configured correctly, you should see the following as part of lsnrctl
service command:
- 16
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
Below you will find two nodes schematic example of remote listeners registration.
LISTENER_LH41 LISTENER_LH42
G41 G42
lh1 lh2
Client nodes must be able to resolve IP aliases lh1-vip, lh2-vip and lh3-vip as they are
stated in tnsnames.ora descriptors used for local_listener and remote listener even if
client connection strings are using IPs or other aliases, perhaps, including domain
name. This is important as server-side connection load balancing will cause connection
requests to be redirected using those aliases.
Note that DBCA in Oracle 10g Release 2 doesnt set local_listener and this causes local
host names to be used for connections that are redirected to remote instances.
Consequently, your virtual IPs are not used impacting connection failover capabilities.
Prior to Oracle Database 10g Release 2, listener was only able to make routing decision
based on host load or instance load. In fact, this is still the case with 10g Release 2 and
11g when, for whatever reason, listener doesnt have the load information for a service.
This is why it still makes sense to discuss this mechanism.
- 17
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
LISTENER_G44
Which
instance?
Enabling listener trace on USER level, we can have a peak on the information that
listener gets. Here is an example:
Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
Opened trace file:
/nfs1/oracle/oracle/product/10.2.0/db_1/network/trace/listener_lh1
.trc
The command completed successfully
- 18
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
Note that if there are no connects to and disconnects from the instances then you might
need to wait for a while. Alternatively, you can simply connect/disconnect in another
session and this will cause PMON to send updates to the listener.
What we see in the trace is two pairs of load data (ld) and max load data (mld) -
ld1/mld1 + ld2/mld2. One pair represents node load and another pair - instance load.
Which one is which? That depends on the listener configuration.
DBAs have been often advised to set listener parameter in listener.ora file
PREFER_LEAST_LOADED_NODE_<LISTENER>=OFF. This is suggested as a solution to
uneven connection distribution amongst RAC instances.
Standard listener behavior is to send connections to the least loaded node. Thus, spikes
in CPU consumption can really screw up connections distribution which has very
negative impact on the applications with persistent (or relatively long living) connections.
It turned our that internal implementation of this parameter is very simple. When
PREFER_LEAST_LOADED_NODE=ON or not defined (ON is default), listener sets ld1/mld1
pair as node load and ld2/mld2 as instance load. When parameter is set to OFF, listener
swaps load data and ld1/mld1 represent instance load while ld2/mld2 - node load.
So what are those numbers exactly? Instance load, as you probably already figured out,
is number of user connections. You can validate it querying GV$SESSION:
- 19
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
INST_ID COUNT(*)
---------- ----------
1 5
2 7
3 5
This is what ld1 or ld2 is set to. Note that querying GV$ tables spawns parallel slaves
on each instance so it will always return number of session higher by one.
What about maximum instance load - mld? Its simply maximum number of sessions
per instance - sessions init.ora parameter:
What about node load? Maximum node load is based on the number of CPUs. On Linux
and Solaris SPARC I observed that its calculated as number of CPUs * 5120. PMON
calculates it based on cpu_count init.ora parameter. I would expect it to be the same
on all platforms.
Current node load seems to be calculated based on the run queue state. It correlates
very well with 1 minute load average. I also traced system calls of PMON process on
Linux and it reads /proc/loadavg file that represents load averages and current run-
queue state. It seems that formula is roughly something like the following but there is
slight discrepancy:
Back to the listener now. As I cold see the algorithm how listener picks the instance is
unchanged regardless of PREFER_LEAST_LOADED_NODE parameter. It first evaluates
instances based on ld1/mld1 pairs and, if there are two instances with the same
attractiveness, then listener compares ld2/mld2 pairs.
Unfortunately, its not possible to distinguish from the trace which instance reports
particular load data. There is one useful trick that we can employ. We can set sessions
parameter on each node so that it different by one session - it wouldnt make any
noticeable difference while let us distinguish load updates for each instance. I set up
sessions parameter to 171,172,173 for g41, g42 and g43 respectively and I can clearly
attribute the load updates now:
- 20
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
PMON is notified on every connect and disconnect because it tracks all the sessions
and needs to be aware of them to perform its duties. Since opening and closing
connections causes a change to instance load, PMON, waking up, will notify listeners as
well.
There is the way to trace PMON process as well - 10257 trace name context
forever, level 16.
The trace above is from 11g and for now we need to make a note of two lines that are
marked bold. This is what is reported at ld1 and ld2 in listener trace (depending on
PREFER_LEAST_LOADED_NODE parameter). This is another way to observe how each
instance reports node and instance load. We will see later that PMON trace contains
other useful stuff.
To summarize, PMON reports instance load (based on load average/run queue) and
instance load (count of USER sessions) to the listener as well as maximum instance
load (sessions init.ora parameter) and maximum node load (cpu_count*5120). Listener
depending on PREFER_LEAST_LOADED_NODE uses this load information to determine
where to route the connection - to the least loaded node or to the instance with the
fewest number of USER sessions.
- 21
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
classification in the next section which discusses the latest connection load balancing
approach, which is valid when using services configuration properly.
With introduction of services, listener has got the ability to balance connections based
on service statistics. This is where AWR comes into play with service-aggregated
metrics. Note that this only appeared in Oracle Database 10g Release 2.
Oracle provides two distinctive options for service-based connection balancing - for long
sessions and short sessions.
In order to manage workload efficiently, its very important to know your application
connection life-cycle. This will greatly influence the approach of connection
management. Usually, connections types would fall into one of the following categories:
Short living sessions when applications connects for each transaction, does some
work and then disconnect. This is probably the most inefficient way of connecting to
the database. Unfortunately, some application stacks dictate this approach. PHP or
Perl web applications with Apache is the infamous example.
Oracle provides standard connection pooling features in all of its drivers and
they are typically very flexible and powerful.
Long connections include ones from connection pool and, especially, persistent
connections. Required characteristic is that connection stays idle for a significant part of
connection lifetime. This fact is often overseen but its very important to take into
account as you see later when we discuss real-time load balancing.
- 22
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
Short living connections are the ones that connect to the database, do some work and
disconnect. These are the connections that do some work for most of the time they are
connected.
Note, long sessions that are active almost 100% of their connection time, such as as
batches, are much closer to short living connections than to long connections (in terms
of workload management). Data warehouse environment that connects for each
request/report should typically be considered as short connections as well even though
it might span several minutes.
Sometimes the choice between long and short connections is obvious whereas in other
cases it might be tricky so its very useful to understand key internals of connection load
balancing and to be able to track down whats going on under the hood.
Each service has attribute CLB_GOAL (Connection Load Balancing goal) which can
take value CLB_GOAL_LONG or CLB_GOAL_SHORT (constants in DBMS_SERVICE
package). It can be specified when service is created or modified using DBMS_SERVICE
package. To see current status, query DBA_SERVICES view, which has column
CLB_GOAL.
Again, unfortunately, there is no notion of instance in the trace. Note that there are two
characteristics per service reported. We can get more meaningful result from PMON by
setting already familiar event 10257 on level 16:
In PMON trace we see service goodness - its the same value are reported in listener
trace line what:2. What is service goodness then?
OK. What about the second value - the one reported on the line what:4 in listener
trace? Apparently its goodness delta. Delta is how much goodness is expected to
change when we add one more connection. Typically, it would be a positive value since
adding one more connection to the instance should only decrease its attractiveness that
should result in a higher value of goodness.
Now that we know how to spy on listener and PMON, its good time to answer the
question - how are the goodness and delta calculated? And the answer is - it depends!
INST_ID COUNT(*)
---------- ----------
1 50
1 51
- 24
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
What about delta value. Well, if goodness is represented as number of sessions then
adding one more session will increase goodness value by one so delta is constant and
equal to 1 for CLB_GOAL_LONG. Delta is never changed so thats probably why PMON
doesnt post it in the trace. We can see it in the listener trace when service is first
registered on service up:
Since service goodness is reported together with node/instance load, we can use the
same trick to distinguish which lines correspond to which instance - goodness and delta
seems to follow instance and node load:
The trick with different number of sessions seems to work for simple cases but I noticed
it can get out of sync when there are many services and service updates.
Service metrics are tracked based on one minute and five seconds intervals so for each
service we get 2 rows per instance in V$SERVICEMETRIC view - for minute interval and
five seconds interval. Interval boundaries are in columns BEGIN_TIME and END_TIME,
and interval length in centi-seconds is in INTSIZE_SEC. There are self-explanatory
columns GOODNESS and DELTA which we know already. Whats more interesting is
another 4 columns that represent average number of calls per second, amount of CPU
per call as well as DB Time per second and per call:
CPUPERCALL
DBTIMEPERCALL
CALLSPERSEC
DBTIMEPERSEC
- 25
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
These are average values for respective interval. Note that cumulative counters for
consumed CPU and DB Time as well as other statistics are available in
V$SERVICE_STATS.
Its now time to introduce Load Balancing Advisory (LBA). Usually, LBA goal is
mentioned in the context of real-time load balancing and its not widely known that LBA
also used by server-side connection load balancing for short connections (with
CLB_GOAL_SHORT).
There is a nasty bug that I couldnt identify for a while - see [MLN5593693.8]. It kept
screwing up my experiments and produced inconsistent results on some systems. The
bug is only noticeable when db_domain init.ora parameter is not empty. Here is what
happens.
What happens now is that somewhere between PMON and a listener (I suspect its the
listener process), mismatch occurs and listener doesnt account for updates sent for that
service without a domain. There is a bit of speculations here so, perhaps, the root
problem is somewhere else but this assumption explains the erroneous behavior very
well so far. Metalink Note 5593693.8 says that this happens when service
network_name is not fully qualified and db_domain is non-null.
- 26
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
In this situation, listener falls back to the old connection load balancing principle based
on instance and node load data. I think this is why I often see people still messing with
PREFER_LEAST_LOADED_NODE even in Oracle Database 10g Release 2. On the other
hand, I saw many complains that PREFER_LEAST_LOADED_NODE doesnt work
anymore. Both are wrong... or not.
It seems that there is a simple workaround - creating fully qualified services with
domain. Indeed, it works very well when service is created by DBMS_SERVICE package
with domain explicitly specified. However, DBMS_SERVICE package doesnt create CRS
resources in RAC environment. Thats why srvctl must be used to create cluster
services for RAC database and later they can be modified using
DBMS_SERVICES.MODIFY_SERVICE.
Unfortunately, srvctl refuses to create a new service with domain matching default
database domain - instead, srvctl requires creating a service with empty domain so
that default domain is added during listener registration. The problem is that CRS now
uses plain service name (without domain) in the database and that puts us back to
square one - service names mismatch.
There is one exception I saw in this situation - if database is created with db_domain
set then default service based on db_name and db_domain is created with fully
qualified name. In my situation, default service g4.oracloid.com was stored in the
database as fully qualified name including domain. However, this service is not
controlled by CRS and shouldnt be touched.
Both, old and new, Connection Load Balancing algorithms have inherent capabilities to
balance connection across nodes of different capacity.
Node load maximum based on the number of CPUs. Instance load can be accounted by
setting SESSIONS init.ora parameter accordingly to the node capacity be it CPU or
memory.
Service Metrics based load balancing is a bit more complex and, as usual, its not bug
free in the first few releases - see [MLN6613950.8]. Apparently, service goodness
calculations do not always account for different node capacity. Metalink Note 6613950.8
identifies the bug but doesnt provide much detailed information. Its said that all current
versions believed to be affected (<11.2). On the other hand, the fix is included in
10.2.0.4 patchset. One-off patches are available for 10.2.0.2 and 10.2.0.3 patchsets.
However, it might be a dirty fix instead of a conceptual resolution.
From here we will move to details of Load Balancing Advisory and how it works with
run-time workload balancing. Before we do that, we need to review Oracle 10g Fast
Application Notifications mechanism.
- 27
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
This program has its roots in Oracle Application Server as far as my familiarity with it
goes. Its a simple daemon providing functionality of exchanging events based on
publish and subscribe mechanism.
HA Events
Oracle Database 10g Release 1 uses ONS daemon as a messenger for High
Availability (HA) events. These include events triggered on node coming up and down,
instance failures, service up and down and etc. The purpose of HA events is to support
Fast Connection Failover (FCF) which we do not cover here in details. We already had
a chance to deal with FAN HA events earlier during the example of server-side FAN
callouts. Listener, for example, also subscribes to some events to speed up services
availability updates on instance failure.
FAN HA events are passed to ONS (published) and ONS duty is to forward it to all
subscribers that are interested in that particular event. In addition, the event is passed to
other ONS daemons registered as remote ONS servers. In RAC environment, ONS
daemons on each node are aware of each other. In addition, ONS servers running on
application tiers (particularly important for Oracle Application Servers) should be
registered as remote ONS as well.
See [BLUNDHILD] for more details about Fast Application Notifications, Fast
Connection Failover and ONS configuration.
Oracle Database 10g Release 2 extended FAN framework by adding Load Balancing
Advisory events. Every 30 seconds, a CRS daemon racgimon gets LBA info from
database instances and passes it to ONS daemon that pushes them further to
subscribers that are interested in LBA events (sometimes also referenced as Service
Metrics events).
LBA events are only generated for services with LBA goal set to GOAL_SERVICE_TIME
or GOAL_THROUGHPUT. Default value, GOAL_NONE, disables service metrics events.
Its very handy to tail this log file to see the LBA trend in the terminal. Percent value
represents recommendation what part of work for that service should flow to particular
instance. As you noticed already, sum of all percentages is 100% as one would expect.
Flag denotes service metrics status for a service on particular node. GOOD flag means
advisory is available and is up to date, NO_DATA means missing data for that
instance/service and UNKNOWN flag is usually posted when no connections has been
made and workload processed on a give instance.
USER_DATA(SRV, PAYLOAD)
--------------------------------------------------------------------------------
SYS$RLBTYP('service10g', 'VERSION=1.0 database=g4 service=service10g { {instance
=g42 percent=34 flag=GOOD}{instance=g43 percent=33 flag=GOOD}{instance=g41 perce
nt=34 flag=GOOD} } timestamp=2008-02-11 03:43:52')
Detailed case studied and benchmarks of different advisory settings can be found in the
paper presented via web-cast on IOUG RAC SIG by Naoko Kanemoto Kurinami - see
[NAOKO].
Bulk of the work has already been done by Oracle and, in many cases, its a matter of
few configuration changes and few additional lines of code. Unfortunately, out of the
box, run-time load balancing is only available when using standard Oracle connection
pooling mechanisms. Otherwise, there is much more efforts and coding required.
- 29
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
The idea is similar for all implementations. Application needs to subscribe for Service
Metrics Events (LBA events) and route the workload between connections based on
these recommendations.
During the presentation we cover run-time load balancing using thin JDBC and below
you can find detailed explanation to the relevant Java code. My original idea was to
include OCI and ODP.NET examples as well. However, due to the lack of time, I wont
be able to cover it during presentation and adding examples here, just for the sake of it,
doesnt make much sense. Instead, great examples and explanation is available in
[BLUNDHILD].
In this example, we create application that uses Oracle JDBC implicit connection cache
with Fast Connection Failover (FCF) option enabled. FCF in JDBC includes automatic
support for LBA events.
The application will subscribe remotely with ONS daemons running on the RAC nodes
so ONS servers should be configured for remote subscription (remoteport parameter
in ons.config for 10g; 11g is able to pick up remote port from OCR stored
configuration for remote ONS). Connection cache manager subscribes to HA for FCF
functionality and LBA events for run-time load balancing. Connection cache will
gravitate workload to the instance according to the percentage weights from LBA
events. There is some factor of randomness added as well.
Here is the except from the full example. First of all, we create data source as usual.
Note format of the connection descriptor.
ods.setConnectionCachingEnabled(true);
Properties prop = new Properties();
prop.setProperty("MinLimit", "5");
- 30
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
prop.setProperty("MaxLimit", "20");
prop.setProperty("InitialLimit", "10");
prop.put (oracle.net.ns.SQLnetDef.TCP_CONNTIMEOUT_STR,"" + (1000)); // 1 sec
ods.setConnectionCacheProperties(prop);
ods.setConnectionCacheName("MyCache");
Above, we set TCP Connect Timeout property that helps to shorten delay during VIP
failover. Its needed due to JDBC specific retry implementation. See [MLN453347.1] for
more details.
Next, we need to enable FCF and subscribe with remote ONS servers to receive FAN
events. JDBC connection cache manager will handle subscription transparently - we just
need to tell it which hosts and ports ONS servers are listening on:
ods.setFastConnectionFailoverEnabled(true);
ods.setONSConfiguration("nodes=lh1:6201,lh2:6201,lh3:6201");
Note that if not all ONS servers are available then it will cause significant delay due to
TCP timeout for each unavailable ONS. I thought to use VIPs for remote ONS port but
Im not sure if there are harmful side effects - its not documented anywhere that ONS
can be configured on VIPs.
After setup is completed, the first call to ods.getConnection() will allocate initial
connections.
The full program used in the demo during presentation is available with online version of
the paper. Its a threaded Java program that executes dummy transactions in parallel.
Oracle provides ability to run JDBC drivers in debug mode. There is a special version of
drivers compiled with additional debug options and it has suffix _g appended. If you use
JDK 1.5 and normal thin JDBC driver is ojdbc5.jar then debug version is
ojdbc5_g.jar. You should use debug version in you CLASSPATH.
You will also need to create a logging properties file (it can be managed at run-time but
we need a simple example here). Here is the minimal setting you would need to trace
operations related to implicit connection cache including run-time load balancing.
- 31
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
oracle.jdbc.pool.level = FINEST
Providing properties file name is OracleLog.properties and debug JDBC .jar file is
in the CLASSPATH, we just need to add couple option on the command line starting the
program:
-Doracle.jdbc.Trace=true
-Djava.util.logging.config.file=OracleLog.properties
The trace produces huge amount of lines but for the purpose of RLB, you would want to
search for parseRuntimeLoadBalancingEvent:
...
Mar 2, 2008 12:43:19 AM oracle.jdbc.pool.OracleConnectionCacheManager
parseRuntimeLoadBalancingEvent
TRACE_16: Enter: "s10gr2", [B@82eca8
...
Mar 2, 2008 12:43:19 AM oracle.jdbc.pool.OracleImplicitConnectionCache
updateDatabaseInstance
TRACE_16: Enter: "g5", "g52", 58, 1
...
Mar 2, 2008 12:43:19 AM oracle.jdbc.pool.OracleImplicitConnectionCache
updateDatabaseInstance
TRACE_16: Enter: "g5", "g51", 42, 1
...
TRACE_20: Debug: (RLB) OracleImplicitConnectionCache.processDatabaseInstances: <<<
ServiceName=s10gr2, Connections to g52 = 8, Attempted Connection Requests to this instance=0,
Total Connections=30 >>>
Mar 2, 2008 12:43:19 AM oracle.jdbc.pool.OracleImplicitConnectionCache
processDatabaseInstances
TRACE_20: Debug: (RLB) OracleImplicitConnectionCache.processDatabaseInstances: <<<
ServiceName=s10gr2, Connections to g51 = 22, Attempted Connection Requests to this
instance=0, Total Connections=30 >>>
...
If you browse through the trace a bit more you can see confirmation that there is some
factor of randomness introduced:
TRACE_20: Debug:
OracleImplicitConnectionCache.retrieveFromConnectionList()RandomPercent=21:
percentSum=58
This trace would also be useful to monitor how FCF is working in JDBC but you might
need to enable tracing for oracle.jdbc and oracle.jdbc.driver in addition to
oracle.jdbc.pool (in OracleLog.properties file).
Sometime, it might not be possible to use implicit connection cache. Other times,
standard workload management of implicit connection cache does not work well for a
particular application. In this case, its possible to use Oracle Notification Client (ONC)
API. I.e. its possible to write your own notification subscriber and implement custom
processing of LBA events.
- 32
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
Unfortunately, in the current implementation of ONC API (both 10g and 11g), local ONS
daemon must be running. Remote subscription has not yet been made available for
ONC. This defeats the purpose of thin JDBC as you wont be able to deploy application
without installing a full blown Oracle client (I think that its possible to identify only ONS
related parts and distribute them but it wont be supported by Oracle officially). ONS is
daemon is also included with Oracle Applications Server.
Full demo of ONC Subscriber is available online (demo 8) and here is a short
walkthrough using Java ONC API.
Firs of all, we need to include ons.jar in CLASSPATH and imports should include
import oracle.ons.*;
We have to define path to the Oracle home with running local ONS server:
System.setProperty("oracle.ons.oraclehome",
"/nfs1/oracle/oracle/product/10.2.0/client");
or selected events (like LBA events) of particular database (g4 database on lh cluster)
Then we can issue a blocking call waiting for the next event
Notification n = s.receive(true);
The call will return when event is posted and we can process it as needed. Below we
print event header, type, body length and the content of the body:
n.print(); // headers
System.out.println(n.type()); // event type
System.out.println(n.body().length); // body length
String event = new String(n.body()); // transform to String
System.out.println(event); // print content
- 33
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
We can even use ONC API to generate events for other subscribers. But... hold on!
Does it mean we can come up with our own Load Balancing Advisories using our own
mechanisms?
In one of the examples, I saw a comment about special event type javaAPI and
example onc_publisher.java. Unfortunately, I wasnt able to locate this example on
either Metalink or Google so I had to do some research and found class Publisher class
and functionality related to publishing custom events.
Soon I was able to simulate my own Load Balancing Advisory. The demo example is
provided online as usual (demo 9) with some comments following.
Just like with subscribing using ONC API, a local ONS is required. You can use ONS
running from Clusterware home as well. As with subscriber example,
oracle.ons.oraclehome must be set either on command-line or at run-time.
LBA and HA events are simple text strings in proper format. Lets say we want to send
advisory suggesting 80% of the load to instance g51 and 20% to instance g52:
Then we need to create our notification (it accepts only byte array so we need to convert
string):
Two empty arguments are affected components and affected nodes. Those are useful
for HA events but can be ignored for LBA events so I left them empty.
p.publish(n);
- 34
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
You can check ONS log file or use ONC subscriber example to verify it. If you trace your
JDBC connection cache, you would see that event was processed appropriately.
Final thoughts
Oracle has done a nice job providing DBAs and application developers with the number
of tools that allow automation of workload management in RAC to the great extent.
However, it still does require careful configuration, planning, understanding of
application workload patterns and, most importantly, tight collaboration between
developers and database administrators to achieve full integration of applications and
database.
The more standard Oracle components are used, the easier it is to make them work
together and integrate database workload management into full end-to-end enterprise
load balancing stack. On the other hand, Oracle left the door open for third-party
vendors and custom solutions which require more efforts but, at the same time, provide
more flexibility.
Even for applications that are not based on open standards, its possible to implement
custom workload management scenarios with ONC API or server-based FAN callouts
that can trigger external manipulation of application configuration in response to certain
events.
References
BLUNDHILD - Barb Lundhild - Automatic Workload Management with Oracle Real
Application Clusters (Oracle 11g)
NAOKO - Naoko Kanemoto Kurinami - NS Solutions: Real World Test Results for
Maximizing Load Balancing with Real Application Clusters
- 35
Copyright The Pythian Group 2008
Oracle RAC Workload Management Alex Gorbachev
MLN5593693.8 - Metalink Note 5593693.8: Connection load balancing does not work
if db_domain is set
MLN6613950.8 - Metalink Note 6613950.8: RAC load balancing does not work when
one server is heavily loaded
+++
Few DBAs are as well equipped as Alex to handle any kind of database scenario. He
brings to The Pythian Group strong analytical skills and over 10 years of experience
with Oracle technologies. He has special expertise in high-availability, RAC, and
performance monitoring.
In his work before joining The Pythian Group, Alex developed and administered Oracle
databases and applications for LUKOIL, the Russian petrochemicals giant. He later
oversaw high-availability mission-critical applications for Amadeus, the leading Global
Distribution System and biggest processor of travel bookings in the world.
Alex is also a respected figure in the Oracle world, frequently presenting at international
conferences, and regularly publishing articles on The Pythian Group's weblog. Alex is a
member of OakTable Network and has been awarded Oracle Ace title by Oracle.
The Pythian Group is a leading provider of remote database administration services for
Oracle, SQL Server and MySQL. Providing top-to-bottom expertise -- from forensics and
architectural design to monitoring and tuning production environments -- is a hallmark of
our offering. With our expertise, flexible contracting, fractional resourcing, 24/7 service
and global reach we can achieve improved quality of service at a reduced cost.
www.pythian.com
- 36
Copyright The Pythian Group 2008