Documente Academic
Documente Profesional
Documente Cultură
Contents
1 About this test report 6
1.1 Trademark 6
1.2 Legal 6
1.3 Document history 7
1.4 Contributors 8
2 Introduction 9
3 Executive summary 10
3.1 HA tests with online traffic 10
3.2 HA tests with COB 12
3.3 DR tests with online traffic 13
3.4 DR Tests with COB 14
3.5 Scalability testing 15
4 Solution deployment 16
4.1 Solution description 16
4.1.1 WebSphere terminology 17
4.1.2 Architecture naming conventions 18
4.2 Architecture diagram 18
4.2.1 Processes in test series 19
4.3 HA design considerations 20
4.4 DR design considerations 20
4.5 Software used 21
5 Testing approach 22
5.1 Test data 22
5.2 Tools 22
5.3 HA tests with online traffic 22
5.3.1 Test traffic generation 22
5.3.2 Test validation 23
JMeter error count 24
Test execution 24
5.4 HA tests with COB 25
5.5 DR Tests with online traffic 26
5.6 DR tests with COB 26
6 Baseline tests 27
6.1 Baseline test: COB against fresh database 27
6.2 Baseline test: COB on DB with added transactions 27
15 Glossary 140
IBM is a registered trademark of IBM Corporation and/or its affiliates. Other names may be trademarks of
their respective owners.
1 .2 Legal
© Copyright 2018 Temenos Headquarters SA. All rights reserved.
TM
The information in this guide relates to TEMENOS information, products and services. It also includes
information, data and keys developed by other parties.
While all reasonable attempts have been made to ensure accuracy, currency and reliability of the content in
this guide, all information is provided "as is".
There is no guarantee as to the completeness, accuracy, timeliness or the results obtained from the use of this
information. No warranty of any kind is given, expressed or implied, including, but not limited to warranties
of performance, merchantability and fitness for a particular purpose.
In no event will TEMENOS be liable to you or anyone else for any decision made or action taken in reliance
on the information in this document or for any consequential, special or similar damages, even if advised of
the possibility of such damages.
TEMENOS does not accept any responsibility for any errors or omissions, or for the results obtained from the
use of this information. Information obtained from this guide should not be used as a substitute for
consultation with TEMENOS.
References and links to external sites and documentation are provided as a service. TEMENOS is not
endorsing any provider of products or services by facilitating access to these sites or documentation from this
guide.
The content of this guide is protected by copyright and trademark law. Apart from fair dealing for the
purposes of private study, research, criticism or review, as permitted under copyright law, no part may be
reproduced or reused for any commercial purposes whatsoever without the prior written permission of the
copyright owner. All trademarks, logos and other marks shown in this guide are the property of their
respective owners.
1 .3 Document history
Version Date Change Author
1 .4 Contributors
Temenos
Name Role
IBM
Name Role
2 Introduction
This IBM Stack 4 Reference Architecture HADR Test Report presents the results of the High Availability
(HA) Disaster Recovery (DR) and scalability testing that Temenos carried out on its Core Banking System.
We tested the architecture using MQ connectivity between the web and application (App) layers. The
software was deployed in line with the R17 Stack 4 for AIX / WebSphere. The stack is supported by Temenos
for all post R16 AMR releases, up to and including R17 AMR.
3 Executive summary
We carried out high availability, disaster recovery and scalability testing on the Temenos Core Banking
system. The tested architecture is n-tier clustered with manual failover to DR. It comprises four layers:
l Web
l Message
l Application
l Data
Cluster load balancing was found to function well during all failure scenarios.
The numbers in the table are totals. For example, in the Host reboot entry, the 11 JMeter errors are the total
errors caused by all three Virtual Machine (VM) reboots.
The infrastructure behaved as expected under all online failure scenarios. These tests involve the most
demanding and non-graceful events, and almost always result in the loss of en route transactions.
%
Average Availability
DB missing
HA test description Test details JMeter errors downtime (Assumes
errors
(seconds) 100 events
per year)
Web layer
Kill 1 Deployment
0 0 0 100%
Deployment Manager Manager (DM) process
and Node Agents Kill and restart one by
failures one, all Node Agents 0 0 0 100%
(NA)
App layer
Kill all Application
Server (AS) processes 1 1 7 99.9999%
one by one.
Graceful shutdown of
0 0 220 100%
AS
Kill Deployment
Application server 0 0 0 100%
Manager (DM)
Failure
Kill NA and restart it
0 0 0 100%
and repeat on all NAs
Data layer
%
Average Availability
DB missing
HA test description Test details JMeter errors downtime (Assumes
errors
(seconds) 100 events
per year)
Graceful DB node
Node shutdown 45 45 210 99.9990%
shutdown
Messaging layer
The availability percentage is calculated by assuming one hundred such events over a year and 15
transactions per second (tps) average throughput, as we measured it during the test.
COB was started in servlet mode within WebSphere, with two COB agents (tSA) per server, resulting in 6
worker threads in total. The service management in the Temenos Core Banking is transactional and extremely
flexible by design. As expected, all the tests finished without any errors.
In all cases, the uncommitted transaction blocks that resulted from the failure scenarios were rolled back in
the DB and execution resumed from whatever available resources were still available.
HA test COB duration
Test details
description (minutes)
Kill all AS, DM, NA and IHS one by one but only when previous kill has been
COB Under 31
restored.
Failure
Condition Shut down one of the RAC nodes and bring it back up. 31
The COB times are to be compared to baseline tests with 6 tSAs that took 30 min. The duration differences
were not significant because:
Downtime in the table is the time period when a service is unavailable. Downtime includes an additional
voluntary delay -about 2 minutes - between killing the DB and triggering the failover. The test results were as
expected, with failed transactions occurring only for the duration of the system downtime.
In the following two tables, we list the availability percentage, which is calculated by assuming 100 such
events over the period of a year.
The Recovery Time Objective (RTO) is the duration of time and a service level within which a business
process must be restored after a disaster in order to avoid unacceptable consequences associated with a break
in continuity. The RTO is listed in the following two tables and it is essentially the time between the first and
last errors in JMeter.
Downtime
%
DR test description Test details Load balancer
DB failover RTO Availability
switchover
Downtime
%
DR test description Test details Load balancer
DB failover RTO Availability
failover
Shutdown live DB
~242
Online traffic on live site. and execute a 2 mins 22 secs ~0 secs 99.92%
secs
manual failover.
Duration
DR test description Test details
COB DB failover
COB on Live Site Start COB, shutdown Live DB and execute a manual failover. 31 mins 1 min 47 secs
Duration
DR test description Test details
COB DB switchover
3 .5 Scalability testing
The scalability tests checked the elasticity of the infrastructure by adding a new Web node and a new
Application (App) node. Both tests were successful and as expected the new node successfully joined the
cluster and received traffic according to the load-balancing rules.
SC test
Test details Results
description
Add a new Web layer node to Traffic distributed to the new Web Application Server (AS) as part of
Online traffic the existing cluster. the round-robin load balancing process.
on live site. Add a new app layer node to
New app node received traffic generated from the Web layer nodes.
the existing cluster.
4 Solution deployment
4 .1 Solution description
Our High Availability solution is a 4-tier architecture, comprising:
l A Web layer.
l An App layer.
l A messaging layer.
l A Data layer.
On the Live site, both the App layer and Web layer are in two separate clusters, each containing three
application servers (AS). Two of the Web layer nodes also have IBM HTTP Server (IHS) configured to
forward in a round-robin fashion the http requests across all three Web layer nodes. Session replication is
enabled so that messages are not lost in case one of the node fails.
An external Apache Load Balancer (LB) is configured on a separate physical machine to forward incoming
http requests to both IHS instances. Traffic received:
The switch between sites was simulated on the level of the JMeter host, by changing the mapping of the IP in
the “hosts” file.
The messaging layer contains two physical hosts, each with a Queue Manager configured in active/standby
mode so that when the active one fails the standby automatically becomes active. The failed QM will need to
be restarted so that it becomes standby.
The same set up also applies to the DR site, except that the Web and App layers have two instead of three
nodes in their respective clusters.
The Data layer of the DR site is serviced by an Oracle RAC database (DB) with two nodes. This DB is
connected through Oracle’s Data Guard technology to the Live site and added to the host schema, as
required by Oracle technology. The RAC database contains the Temenos Core Banking schema only. The
database on DR site is kept in sync with the live site database using Oracle’s DataGuard technology.
Deployment Manager (DM) This is the administrative process used to provide a centralized
management view and control for all elements in a WebSphere
Application Server distributed cell, including the management of
clusters.
In our infrastructure, we had two DMs setup for live and two for DR. In
both sites we had one DM for the Web Layer cluster and one for the
Application Layer (App) cluster. Both of these DMs were setup in the
primary server of each cluster (i.e. Web1 and App1).
Application Server (AS) The AS is the primary component of WebSphere. The server runs a Java
Virtual Machine (JVM), providing the runtime environment for the
Temenos code. In essence, the AS provides containers that specialize in
enabling the execution of Temenos libraries and components.
In the tested deployment, we had one AS per node and one node per
host server.
Web layer hosts Referred to as Web followed by an increment: Web1, Web2, and Web3.
Application layer hosts Referred to as App followed by an increment: App1, App2 and App3.
Data layer hosts Referred to as DB followed by an increment: DB1, DB2 and DB3.
For ease of reference, we may refer to a WebSphere process followed by the host increment. For example,
AS1 would mean AS on host #1. The DR site follows the same conventions as above.
4 .2 Architecture diagram
The following processes were running in the live site as part of our test series.
MQ1 QM active
MQ2 QM passive
The following processes were running in the DR site as part of our test series.
MQ1 QM active
MQ2 QM passive
4 .3 HA design considerations
Architecture Design consideration
External load balancer As the recommended highly available hardware load balancer was not
available, two Apache servers for the Live and DR site where configured
on two separate physical hosts.
For the online traffic to get access, either on the live site or DR site, a
Windows VM host file needed to be updated with the desired URL/IP of
the machine where the Apache server was installed.
Web layer Temenos BrowserWeb was deployed on all three Web Layer cluster
members. Online requests coming from the load balancer were
distributed across all the Web Layer nodes using the IHS http load
balancer. Session replication was enabled in IHS and no additional
modifications needed to be done in BrowserWeb parameters file.
App layer The Temenos Core Banking (T24) and Application Framework (TAFJ)
libraries, were installed in all 3 app layer cluster members.
We placed both T24 and TAFJ runtime libraries on each cluster member,
as opposed to separated shared storage. There are two IP addresses
provided for the 2 active/standby Queue Managers.
4 .4 DR design considerations
The live and DR sites are configured in Active/Standby mode.
An infrastructure database has been added to host schema required by Oracle technology. The RAC database
contains the Temenos Core Banking schema only. The database on DR site is kept in sync with the live site
database using Oracle’s DataGuard technology.
4 .5 Software used
Temenos
Software Version
T24 R17
TAFJ R17_SP2
IBM
Software Version
WebSphere 8.5.5.13
IBM MQ 8.0.0.7
Apache
Software Version
Oracle
Software Version
5 Testing approach
The reference architecture exercise focused on the following three areas, each with a set of specific test cases
to be carried out:
l Availability testing.
l Scalability testing.
5 .1 Test data
We used custom JMeter scripts to commit transactions through the Web User Interface (UI) of the Temenos
Core Banking. The total number of script cycles were 1000.
5 .2 Tools
JMeter and Nmon
The requests are sent to the Apache load balancer, which forwards requests to IHS instances on the two Web
Layer nodes. Fifty users are configured, and five JMeter instances each execute execute 10 concurrent threads
for 10 users. A sample file feeds 1000 JMeter script cycles for processing.
Each user iterates 20 times, which means that the 50 users execute the 1000 loops, drilling data from the
sample file.
Each thread executes sessions that carry out the following tasks:
2. Login.
3. Create customer
5. Open till.
10. Logoff.
A constant throughput timer is used to limit throughput to 5 Transactions per Second (TPS). Each JMeter
testing cycle was executed in its own session.
JMeter has been configured to stop the session in case of failure occurring during execution and start a new
one.
The JMeter scripts have robust response assertions. In addition, at the end of every test run, the following
JQL scripts are executed against the database to count the total number of records that have been inserted:
The COUNT queries are designed to return all the records that were inserted by the
JMeter scripts (and only these records).
Every thread on JMeter executes the requests sequentially. If during a failure test the login page failed, then
all subsequent transactions will fail. To avoid registering these additional errors, which are a consequence of
the previous step, JMeter has been configured to stop the session and start a new one. Errors logged by
JMeter represents errors caused by the failure test.
Errors reported by JMeter reflect what a real end user would see.
Test execution
IHS kill
date;ps -ef | grep httpd | grep -v Xvnc | grep -v grep | awk '/http/ {print $2}'
date;i=$(ps -ef | grep httpd | grep -v Xvnc | grep -v grep | awk '/http/ {print $2}');for x in $i;
do kill -9 "$x";done
apachectl -k graceful
date; ps -ef | grep "QM1 -x" | grep -v NF | grep -v grep | awk '/QM1/ {print $2,$(NF-1), $(NF)}'
date;kill -9 $(ps -ef | grep "QM1 -x" |grep -v NF | grep -v grep | awk '/QM1/ {print $2}')
DB Node shutdown
While traffic is running, the following command is used to shut down the DB service.
su – oracle
sqlplus / as sysdba
shutdown transactional;
Graceful shutdown of AS
While traffic is running, use WebSphere console to gracefully shut down and then start the application
servers one at a time.
Restart of VMs
While traffic is running, use the reboot command to restart the relevant box.
b. If failover test: run srvctl stop database –d t24db –o abort, wait a couple of minutes and
then execute failover.
3. Update the Windows host file to DR load balancer and then disable routing to live site load balancer.
1. Start COB on LIVE site and wait for until it gets to the Application stage.
a. If switchover test: stop COB on live site first and then execute switchover.
b. If failover test: run srvctl stop database –d t24db –o abort, wait a couple of minutes and
then execute failover.
6 Baseline tests
Running baseline tests involved:
3. Creating a DB restore point which could be used for other COB tests.
As the table shows, after 5000 JMeter script cycles, COB took nearly two minutes longer to complete. There
were no errors recorded.
l Traffic will be routed to the remaining nodes until failed AS has fully restarted.
l Traffic will be balanced between all nodes again once failed AS has fully recovered.
KillAppAS.processes
APP PR1
[eg15ph09:root] / # date;ps -ef | grep java | awk '/AppSrv0/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 10:22:13 GMT 2018
8388872 eg15ph09Node01 AppSrv01
[eg15ph09:root] / # date;kill -9 $(ps -ef | grep java | awk '/AppSrv0/ {print $2}')
Tue Feb 20 10:24:38 GMT 2018
Process back up at: 10:24:41
=================== APP PR2
[eg15ph10:root] / # date;ps -ef | grep java | awk '/AppSrv0/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 10:27:48 GMT 2018
5374440 eg15ph10Node01 AppSrv02
[eg15ph10:root] / # date;kill -9 $(ps -ef | grep java | awk '/AppSrv0/ {print $2}')
Tue Feb 20 10:27:56 GMT 2018
Process back up at: 10:28:01
================================== APP PR3
[eg15ph11:root] / # date;ps -ef | grep java | awk '/AppSrv0/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 10:29:42 GMT 2018
8913256 eg15ph11Node01 AppSrv03
[eg15ph11:root] / # date;kill -9 $(ps -ef | grep java | awk '/AppSrv0/ {print $2}')
Tue Feb 20 10:29:47 GMT 2018
Process back up at: 10:29:48
Kill AS process Time Recovery time JMeter errors Adjusted errors Downtime (Approx. secs)
The image below shows a series of JMeter errors. Killing the AS in App1 server caused the Customer error,
which in turn triggered the Account and Cash deposit errors. The killing of AS in App2 and App3 servers
produced no errors.
The errors after 9:27 are due to the restart of the AS in the Web Servers, which is discussed separately in 8.1
Kill of AS process Web layer.
Because one customer failed, the missing accounts comprise two accounts (local and foreign) and two cash
deposits (local and foreign). That means the adjusted total is 1.
At 10:24 the AS process was killed. This led to low CPU activity. The process was automatically restarted,
hence the increased CPU activity, due to initialisation.
After the killing of the AS process, network activity slows down to almost zero. Immediately after the restart
of the process, there is increased network activity due to message inflow and the initialisation of queues.
l Traffic will be routed to the remaining nodes until stopped AS has been restarted manually.
l Traffic will be balanced between all nodes once stopped AS has fully recovered.
GracefulShutDownAppLayerAS.txt
Because this was a graceful kill, the app server had to wait for any transactions to be processed before it shut
down, which is why no errors were recorded.
Graceful kill occurred during the highlighted period in the below image. The small disturbance at the start of
the test occurred outside the testing window and are due to no proper ramping up of the JMeter thread
execution.
A graceful restart resulted in no data loss being experienced as all pending transactions where completed
first, and incoming transactions were routed to the remaining nodes.
The system resources usage is similar to that shown in 7.1.2 System resources usage. The only difference is
that due to the graceful shutdown, there is no automatic restarting of AS, which means there's time for the
activity to settle down to near-zero levels.
l Cluster should recover to a normal state once the rebooted host has recovered.
App layer hosts Process affected JMeter errors Adjusted errors Time
The server reboot of App Layer nodes while injecting transaction through JMeter resulted in some loss of
data.
The image below shows the three events of the application servers’ reboot. The disturbance at the beginning
of the test is due to the sudden activity from multiple threads.
We corrected the stress method in subsequent tests by having a two-minute ramp-up period where
WebSphere was allowed some time to initialise its connection pools.
Because 3 customers failed, we would expect 6 accounts and 6 cash deposits records to be missing. Instead,
there were an additional 2 account and 2 cash deposits records missing. The 2 missing cash deposits records
are explained by the 2 missing accounts. Therefore, the adjusted total is 3+2+0=5.
We have captured uninterrupted resource utilisation diagrams that encompass the VM restart event.
Although impact was obvious in other diagrams as well, we only list the 2 most interesting ones: Disk and
network activity.
Application Server 1
The actual restart was executed at 15:48:51 and finished at 15:48:52. The server was again operationally ready
at 15:51:30. The initialisation of the various resources of the VM can be clearly seen by this big spike in the
disk activity.
The opposite behaviour is observed as expected in network activity. While the VM is restarting or initialising
resources, communications will decrease; after that we expect an increase higher than before the restart, as
the listeners try to consume all remaining messages that did not timeout.
Application Server 2
The restart was executed at 15:52:54 and finished at 15:52:55. The server was again operationally ready at
15:55:51. The behaviour is similar to Application server 1.
Application Server 3
The restart was executed at 15:56:26 and finished at 15:56:27. The Server was again operationally ready at
15:59:25. The behaviour is similar to Application Server 1 and 2.
l Traffic will be routed to the remaining nodes until failed AS has fully restarted.
l Traffic will be balanced between all nodes again once failed AS has fully recovered.
KillWebAS.processes
There were some session timeout errors recorded in JMeter as the Web Layer AS process went down.
The highlighted failed transaction spikes represent the two failed customer creations, one of which is in the
screenshot above. These were due to the killed AS in Web1. The last 2 spikes are due to the killed AS in Web2
and Web3 servers.
From the above table, since 2 customers failed, we expected that to be followed by 4 accounts and 4 cash
deposits missing records. Instead, there was an additional 1 cash deposit missing. That means the adjusted
total is 2+0+1=3.
Restarting the Web Server did not have an impact on most resource utilisation diagrams. The CPU
disturbance was in the level of magnitude of other random events, while in operation. The reason behind this
is that the servers are sized to handle much bigger traffic loads and the CPU can handle easily all tasks.
Network utilisation is not helpful either, because the calls keep coming to the VM, whether the Web Server is
up or not. There are peaks in the diagram, but not significant enough to demonstrate an extraordinary event.
The disk diagrams on the other hand are a different case. There we can see the restart of the AS process quite
nicely. Following, are disk diagrams along with the CPU and network, as an interesting comparison between
Web and Application Server behaviour. See also 7 Application layer HA Tests
Web Server 1
Web Server 2
Web Server 3
l Traffic will be routed to the remaining nodes until stopped AS has been restarted manually.
l Traffic will be balanced between all nodes once stopped AS has fully recovered.
GracefulShutDownWebLayerAS.txt
Kill AS process Time Recovery time JMeter errors Adjusted errors Downtime (Approx. mins)
A graceful shutdown of Web Layer AS process did not cause any loss of transactions. The below screenshot
shows healthy JMeter results tree as seen for the duration of the test.
Graceful kill occurred during the highlighted period in the below image. The small disturbance at the start of
the test occurred outside the testing window and are due to no proper ramping up of the JMeter thread
execution.
A graceful restart resulted in no data loss being experienced as all pending transactions where completed
first, and incoming transactions were routed to the remaining nodes.
The system resources usage here is similar to that shown in 8.1.3 System resources usage. The only
difference is that due to the graceful shutdown, there is no automatic restarting of AS, and thus time for the
activity to settle down to near-zero levels.
The network is not affected, unlike the Application Server shutdown, due to the continuous requests coming
from the injector. In the case of Application Server, the messages are pulled from MQ, not pushed by an
injector.
Note the low CPU utilisation (5-10%) because of the over-sized capacity of the Web Server VMs.
Web Server 1
Web Server 2
Web Server 3
l No traffic disturbance should occur and there would be no impact on existing application servers.
l The admin console should become available again once the DM has fully recovered.
dmkill.txt
was-dm=================================
[eg15ph06:root] / # date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:46:49 GMT 2018
9699660 dmgr -dmgr
10158454 eg15ph06CellManager01 dmgr
[eg15ph06:root] / # date;kill -9 $(ps -ef | grep java | awk '/dmgr/ {print $2}')
Tue Feb 20 14:46:53 GMT 2018
As expected, the WebSphere console was not accessible when the Deployment Manager process was down
and there was not traffic disturbance.
The Deployment Manager is not responsible for any traffic, so the network is not affected. CPU and disk
activity though is, since killing and restarting the process entails initialisation of resources and the update of
state from the servers in the cluster.
Web Server 1
Application Server 1
l No traffic disturbance should occur and no impact on existing ASs or the Admin server.
Kill_NA_on_App_n_Web_layer.txt
was-dm=================================
[eg15ph06:root] / # date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:46:49 GMT 2018
9699660 dmgr -dmgr
10158454 eg15ph06CellManager01 dmgr
[eg15ph06:root] / # date;kill -9 $(ps -ef | grep java | awk '/dmgr/ {print $2}')
Tue Feb 20 14:46:53 GMT 2018
=========app pr1
[eg15ph09:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:50:16 GMT 2018
9765202 eg15ph09Node01 nodeagent
[eg15ph09:root] / # date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Feb 20 14:50:27 GMT 2018
[eg15ph09:root] / # (NF-1), $(NF)}'date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1),
$(NF)}' <
Tue Feb 20 14:51:24 GMT 2018
[eg15ph09:root] / # date;/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/bin/stna.sh
Tue Feb 20 14:51:51 GMT 2018
launching server using startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent
============app pr 3==========================
date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:56:15 GMT 2018
9240874 eg15ph11Node01 nodeagent
[eg15ph11:root] / # date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Feb 20 14:56:32 GMT 2018
[eg15ph11:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:56:38 GMT 2018
[eg15ph11:root] / # date;/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/bin/stna.sh
Tue Feb 20 14:57:50 GMT 2018
launching server using startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent
ADMU3200I: Server launched. Waiting for initialization status.
ADMU3000I: Server nodeagent open for e-business; process id is 6226370
=was pr 1====================
[eg15ph06:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 15:00:34 GMT 2018
5898516 eg15ph06Node01 nodeagent
[eg15ph06:root] / # date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Feb 20 15:00:36 GMT 2018
[eg15ph06:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 15:00:44 GMT 2018
[eg15ph06:root] / # date;/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/bin/stna.sh
Tue Feb 20 15:01:55 GMT 2018
launching server using startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent
ADMU3200I: Server launched. Waiting for initialization status.
ADMU3000I: Server nodeagent open for e-business; process id is 11272492
exit code: 0
was pr 2===================
[eg15ph07:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
=======waspr 3============
[eg15ph08:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 15:05:23 GMT 2018
10027350 eg15ph08Node01 nodeagent
[eg15ph08:root] / # date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Feb 20 15:05:52 GMT 2018
[eg15ph08:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 15:06:00 GMT 2018
[eg15ph08:root] / #
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
[eg15ph08:root] / # date;/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/bin/stna.sh
Tue Feb 20 15:07:29 GMT 2018
launching server using startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent
ADMU3200I: Server launched. Waiting for initialization status.
ADMU3000I: Server nodeagent open for e-business; process id is 8978700
exit code: 0
[eg15ph08:root] / #
AppNA1 14:50:27 0 0
AppNA2 14:54:16 0 0
AppNA3 14:56:32 0 0
WebNA1 15:00:36 0 0
WebNA2 15:02:42 0 0
WebNA3 15:05:52 0 0
Application Server 1
The disk utilisation diagram for Application Server 1 is significantly different from Server 2 and 3. This is
because, Server 1 is running the Deployment Manager as well.
Application Server 2
Notice the difference in the disk utilisation profile in comparison to Application Server 1.
Application Server 3
Web Server 1
The disk utilisation diagram for Web Server 1 is significantly different from Servers 2 and 3. This is because
Server 1 is running the Deployment Manager as well.
Web Server 2
Notice the difference in the disk utilisation profile, in comparison to Web Server 1.
Web Server 3
With IHS configured to restart itself upon failure and session replication enabled in IHS, JMeter was started
and the following behaviour was expected:
l Traffic will be routed to the remaining IHS node until failed IHS has fully recovered.
l Traffic will be balanced between the two IHS nodes once failed IHS has fully recovered.
kill-shutdown-webserver1.txt
14:23:51 7 1
kill-shutdown-webserver2.txt
[eg15ph07:t24user] /Temenos $ date;i=$(ps -ef | grep httpd | grep -v Xvnc | grep -v grep | awk
'/http/ {print $2}');for x in $i; do echo "$x";done
Wed Feb 14 14:30:22 GMT 2018
4653514
6422986
6947320
8323352
8716696
9634142
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $ date;i=$(ps -ef | grep httpd | grep -v Xvnc | grep -v grep | awk
'/http/ {print $2}');for x in $i; do kill -9 "$x";done
Wed Feb 14 14:30:53 GMT 2018
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $ date; /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
Wed Feb 14 14:32:29 GMT 2018
[eg15ph07:t24user] /Temenos $ cd /u01/3rdParty/As/IBM/HTTPServer/bin
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $ date; apachectl -k graceful
Wed Feb 14 14:36:23 GMT 2018
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $ ps -ef | grep http
t24user 6422990 9634156 0 14:36:24 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
14:30:53 0 0
The expected result was tno loss of transactions but we saw a proxy error during a customer creation which
resulted in loss of one customer and its corresponding accounts and cash deposits downstream.
As result of one customer not being created, there was a domino effect errors on the remaining transactions
depending on that missing customer.
The sizing is such that the Web Servers are not stressed completely. That does not allow us to draw
conclusions on CPU & disk utilisation when affecting such light components as IHS. The network utilisation
diagram is the one that gives the most accurate picture.
IHS1
The failure event takes place in the circled portion in the above diagram. It quite clearly shows how traffic
drops to zero and then resumes to normal. Please note that the difference in the network activity between
IHS 1 and IHS 2 is due to the fact that these are housed in the VMs of Web Server 1 and 2 respectively. This
means that along with IHS activity we have NA activity but also DM activity additionally for Server 1.
IHS2
Similar to IHS 1 above, the circled period shows the drop in network activity. After the restart, normal
operation resumes.
l Traffic would be routed to the remaining IHS until the failed host was back.
l Traffic would be balanced between the two IHS instances once the failed IHS had fully recovered.
graceful-shutdown-webserver1.txt
IHS1 14:34:09
graceful-shutdown-webserver2.txt
[eg15ph07:t24user] /Temenos $ date;i=$(ps -ef | grep httpd | grep -v Xvnc | grep -v grep | awk
'/http/ {print $2}');for x in $i; do echo "$x";done
Wed Feb 14 14:30:22 GMT 2018
4653514
6422986
6947320
8323352
8716696
9634142
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $ date;i=$(ps -ef | grep httpd | grep -v Xvnc | grep -v grep | awk
'/http/ {print $2}');for x in $i; do kill -9 "$x";done
Wed Feb 14 14:30:53 GMT 2018
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $ date; /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
Wed Feb 14 14:32:29 GMT 2018
[eg15ph07:t24user] /Temenos $ cd /u01/3rdParty/As/IBM/HTTPServer/bin
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $ date; apachectl -k graceful
Wed Feb 14 14:36:23 GMT 2018
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $ ps -ef | grep http
t24user 6422990 9634156 0 14:36:24 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
t24user 6947098 9634156 0 14:36:23 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
root 7602520 1 0 14:07:58 - 0:00 Xvnc :1 -desktop X -httpd
/opt/freeware/vnc/classes -auth //.Xauthority -geometry 1024x768 -depth 8 -rfbwait 120000 -rfbauth
//.vnc/passwd -rfbport 5901 -nolisten local -fp
/usr/lib/X11/fonts/,/usr/lib/X11/fonts/misc/,/usr/lib/X11/fonts/75dpi/,/usr/lib/X11/fonts/100dpi/,/u
sr/lib/X11/fonts/ibm850/,/usr/lib/X11/fonts/Type1/
t24user 8323366 9634156 0 14:36:24 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
t24user 8716738 9568694 0 14:36:37 pts/0 0:00 grep http
t24user 9634156 1 0 14:32:29 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
[eg15ph07:t24user] /bin $
IHS2 14:36:23
When IHS was gracefully shutdown, there was no data loss experienced. The initial disturbance in the
beginning of the test, is, as in a few other tests, the result of not ramping-up the injected load from JMeter.
This test does not show any difference is workload as the IHS is shutdown gracefully, thus is able to
rebalance the traffic seamlessly to the remaining IHS. The fact that both the IHS are housed on the same VMs
as the Web Servers, means that any slight difference in CPU and disk activity is going to be masked by the
other processes. The network activity is not impacted because the messages are rerouted to reach IHS 2 and
then are balanced between Web Servers normally.
The following diagrams demonstrate this. In the same way as 8.5 Shutdown and start of IHS processes on
Web layer, the DM in Web Server 1 masks the activity based purely on message exchange. Network activity
between Web Server 2 and 3 is much more similar.
Web Server 1
Web Server 2
Web Server 3
l Some disturbance.
l Cluster should recover to a normal state after each rebooted host has recovered.
Web layer hosts Process affected JMeter errors JMeter adjusted Time
Server reboot of Web Layer nodes while injecting transaction through JMeter resulted in proxy error being
reported and some loss of data.
In the following image we can see two purple spikes that represent two errors followed by a third one after a
while. These are caused by the restart of the Web2 node. Notice that the third error is not caused by the
actual restart but by one of the two failed transactions previously.
Given that 3 customers failed, we expect that to be followed by 6 accounts and 6 cash deposits missing
records. Instead, there were an additional 2 account and 2 cash deposits records missing. The 2 missing cash
deposits records are explained by the 2 missing accounts. Therefore, the adjusted total is 3+2+0=5.
This test was carried in conjunction to the restart of App layer VM nodes, as mentioned in 8.7 Restart of
Web layer VM Nodes. During the test, the behaviour was as expected for all servers that were restarted.
Just before the restart of the last Web Server, all the transactions were pumped and thus the test was de facto
over. We took the decision not to repeat the test just to see the Web Server 3 restarting as we had already
proven the resilience of the system. Following are the most relevant resource utilisation diagrams for Web
Servers 1 and 2.
Web Server 1
Web Server 2
The CPU utilisation of Web Server 2 and 3 were typically half of what Web Server 1 was. This seemed to be
the case for all our tests. The reason behind this, is that Web Server 1 is the server running DM on behalf of
the cluster.
The network activity observed after the restart is not typical. We would expect it to bounce right up to the
previous levels. The reason behind this, is that the test was closing to an end, and the transactions were
already ramping down and the parallel sessions from JMeter were slowly reaching zero according to the test
rules.
Downtime is the time taken by the standby to become active. For ease of reference, QM on host 1 will be QM
1 and QM on host 2, QM 2.
killing-QM1.txt
6 2 9
killing-QM2.txt
===========MQ1 restart=====================================
[eg15ph04:root] / #strmqm -x QM1
[eg15ph04:root] / # dspmq -m QM1
QMNAME(QM1) STATUS(Running as standby)
[eg15ph04:root] /
[eg15ph05:root] / # dspmq -m QM1
QMNAME(QM1) STATUS(Running)
[eg15ph05:root] / #
========Kill MQ 2=====================
[eg15ph05:mqm] /mqm # ps -ef | grep "QM1 -x" | grep -v NF | grep -v grep | awk '/QM1/ {print
$2,$(NF-1), $(NF)}'
7930338 -u mqm
[eg15ph05:mqm] /mqm #
date;kill -9 $(ps -ef | grep "QM1 -x" |grep -v NF | grep -v grep | awk '/QM1/ {print $2}')
[eg15ph05:root] / # dspmq -m QM1
QMNAME(QM1) STATUS(Running elsewhere)
[eg15ph05:root] / # dspmq -m QM1
QMNAME(QM1) STATUS(Running)
[eg15ph05:root] / #
44 28 26
rebooting-VM-(QM1).txt
2 1 31
We experienced data loss after killing the Queue Manager. This would have happened during the switch
over to the standby Queue Manager. Below is a sample of the errors as seen in JMeter.
The following tables show the data loss for the different transaction done through JMeter.
We observed that failover from QM on host 2 (QM 2) to QM on host 1 (QM 1) took longer and produced
more errors than failover from QM 1 to QM 2.
Failover from QM 1 to QM 2
Failover from QM 2 to QM 1
Restart VM 1 (host 1)
This diagram is indicative of the disturbance in the transactions. At the time of the QM 1 failure we have two
transactions failing, which happened to be en route from QM 1. Afterwards we see ripple effects because it so
happened that one of the failing transactions was a Customer creation.
As there is no logic in JMeter script to run transaction conditionally, all subsequent Account and Cash
Deposit transactions for this missing customer fail because of the Temenos Core Banking application logic.
Time of QM 1 kill is 11:04:07. The switch from the primary QM to the Stand-by QM can be seen in the
following two network utilisation diagrams from QM 1 and QM 2.
QM 1
The red line indicates the time of failure of QM 1. As expected all network activity drops to near zero.
QM 2
Immediately after the QM 1 failure, QM 2 picked up the load. Web Servers were not disrupted for too long.
Web1 shows a hiccup at the time of the event. Similarly, Web2 and 3 show similar behaviour.
Web1
Similar to Web Servers, App servers have a small disruption at the time of the event.
App2
The effects are delayed a bit as we move lower to the architecture. The DB if it has any impact at all, it is
mostly lost in the runtime turbulence.
DB1
Reverting to the first QM has a similar effect. Now QM 2 is killed so that QM 1 will take over.
QM 2
QM 1
Immediately after QM 2 going down, we have QM 1 taking over again. Web Servers have the usual
disruption at the time of the event.
Web1
We see similar network utilisation diagrams for Web2 and 3. Application Servers have a similar disruption as
with Web Servers. For example, observe App1:
This time the DB servers are impacted by the QM failovers. In the previous test when QM 1 failed to QM 2,
there was no or minimal disruption. In addition, we do not see any time shifting of the disruption as we
move lower to the architecture.
DB1
l The load was not high enough as the DB servers were sized to be more than capable of handling each
one alone, one JMeter instance.
l The messages from a JMeter thread tended to be routed to the same DB node, as long as there was
capacity to serve them.
To overcome this effect and to prove the rebalancing of the DB nodes, we used 5 instances of JMeter with 10
threads each. Three of them started from the beginning and an additional instance would be started after the
restart of each of the two DB nodes. Thus, the test script was the following:
Shutdown transactional
4. Start DB 1.
Shutdown transactional
7. Start DB 2.
l Inflight transactions that have been started on the restarted node are expected to fail.
l Once the node has fully started, traffic will be balanced between the three nodes.
DB node Stop time Command JMeter errors Adjusted errors Start time Command
The DB servers were removed from the server with a small disturbance and when started again, they were
successfully contributing to the cluster. The total transactions missing from the DB were 45, for both the
shutdowns.
The following table summarises the effect this test had in the application data.
Given that 8 customers failed, we expect that to be followed by 16 accounts. Since we had 21 missing
accounts, that means that we had an additional 21-16=5 failed transactions. Because 21 accounts are missing,
we expect this to trigger 21 missing deposits as well. Therefore, 53-21=32 missing deposits cannot be
explained. So the actual failed transactions due to the test were 8+5+32=45 and all the rest were due to the
application logic and the JMeter test script design.
In this view we can see all 3 DB nodes actively serving the application.
Here we can see that DB1 is contributing to the cluster equally to DB 3. DB 2 is missing, but this does not stop
the service.
During the test, the DB behaved well and balanced the load correctly. The DB capacity was quite big and thus
the stress was never in high levels.
DB 1
This is the full screenshot of the Enterprise Manager (EM) Console for DB 1 load. This is taken after the
restart of DB 1 (9:01 - 9:04). Observe that the node starts picking up load.
At 9:06 DB 2 goes down with no adverse effects to the DB 1 node. After DB2 come back up again at 9:10, the
operation is normal. The increase that takes place at around 9:14 is purely operational and not a consequence
of the DB 2 restart.
DB 2
This screenshot is from DB 2, after its restart (9:06-9:10). The utilisation diagram is similar to the DB 1, without
taking into account the missing load metadata, before the restart.
11 COB HA tests
Using the baseline database, failure tests were performed during the COB. The test involved:
l Killing of Application layer AS nodes, Admin server, Node Agent, IHS one by one but only when
previous kill has been restored.
Expected result:
l COB is expected to take longer when app servers in app layer or DB nodes are killed.
l These failures should not stop COB from progressing and finishing successfully.
l When COB is finished, there should be no difference in database performance compared to the baseline
test.
cob_under_failure_condition.txt
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:49:19 GMT 2018
9503156 eg15ph09Node01 nodeagent
[eg15ph09:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Jan 16 07:49:27 GMT 2018
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:49:31 GMT 2018
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:49:39 GMT 2018
[eg15ph09:t24user] /Temenos $ $NODE_HOME/bin/startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent
ADMU3200I: Server launched. Waiting for initialization status.
ADMU3000I: Server nodeagent open for e-business; process id is 9765370
[eg15ph09:t24user] /Temenos $
[eg15ph09:t24user] /Temenos $
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:50:16 GMT 2018
9765370 eg15ph09Node01 nodeagent
------------
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/AppSrv/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:51:07 GMT 2018
6750702 eg15ph09Node01 AppSrv01
[eg15ph09:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/AppSrv/ {print $2}')
Tue Jan 16 07:51:15 GMT 2018
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/AppSrv/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:51:18 GMT 2018
10879470 eg15ph09Node01 AppSrv01
[eg15ph09:t24user] /Temenos $
=================
APPPR2:
[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:52:13 GMT 2018
8716644 eg15ph10Node01 nodeagent
[eg15ph10:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Jan 16 07:52:19 GMT 2018
[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:52:23 GMT 2018
[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:52:24 GMT 2018
[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:52:26 GMT 2018
[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:52:50 GMT 2018
[eg15ph10:t24user] /Temenos $ $NODE_HOME/bin/startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent
ADMU3200I: Server launched. Waiting for initialization status.
ADMU3000I: Server nodeagent open for e-business; process id is 10354998
[eg15ph10:t24user] /Temenos $
[eg15ph10:t24user] /Temenos $
[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:53:27 GMT 2018
10354998 eg15ph10Node01 nodeagent
[eg15ph10:t24user] /Temenos $
----------
[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/AppSrv/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:53:54 GMT 2018
9634140 eg15ph10Node01 AppSrv02
[eg15ph10:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/AppSrv/ {print $2}')
Tue Jan 16 07:54:00 GMT 2018
[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/AppSrv/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:54:04 GMT 2018
6291928 eg15ph10Node01 AppSrv02
[eg15ph10:t24user] /Temenos $
=====================
APPPR3:
[eg15ph11:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:54:55 GMT 2018
5439912 eg15ph11Node01 nodeagent
[eg15ph11:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Jan 16 07:55:01 GMT 2018
[eg15ph11:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:55:12 GMT 2018
[eg15ph11:t24user] /Temenos $ $NODE_HOME/bin/startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent
ADMU3200I: Server launched. Waiting for initialization status.
ADMU3000I: Server nodeagent open for e-business; process id is 9503016
[eg15ph11:t24user] /Temenos $
[eg15ph11:t24user] /Temenos $
[eg15ph11:t24user] /Temenos $
[eg15ph11:t24user] /Temenos $
[eg15ph11:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:55:51 GMT 2018
9503016 eg15ph11Node01 nodeagent
----------------------------------
[eg15ph11:t24user] /Temenos $
[eg15ph11:t24user] /Temenos $ date;ps -ef | grep java | awk '/AppSrv/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:56:00 GMT 2018
7864616 eg15ph11Node01 AppSrv03
[eg15ph11:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/AppSrv/ {print $2}')
Tue Jan 16 07:56:08 GMT 2018
[eg15ph11:t24user] /Temenos $ date;ps -ef | grep java | awk '/AppSrv/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:56:11 GMT 2018
7864618 eg15ph11Node01 AppSrv03
[eg15ph11:t24user] /Temenos $
=================================
DBPR1 : ~ couple of minutes after APPPR3.
[LIVE-DB3 root@eg15ph18 /]# su - oracle
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options
1
----------
1
SQL> startup;
ORACLE instance started.
1
----------
1
SQL>
=======================
DBPR2 : ~8:35 GMT
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options
1
----------
1
SQL> startup;
ORACLE instance started.
1
----------
1
SQL>
===================================
DBPR3: ~8:35 GMT
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options
1
----------
1
SQL> startup;
ORACLE instance started.
1
----------
1
SQL>
In the above table we show the time at which each process was killed and the time it was back up running.
COB is driven by threads running inside WebSphere at the application layer.
When the Application Server process is killed and automatically restarted, the Temenos Service Manager
(TSM) needs to be manually restarted as well, or else COB will have no workers to make progress. That is
captured in the table above by signifying the time TSM was started (i.e. by “TSM @ HH:MM:SS”).
11:26:24 11:57:52 31 mins 28 secs 31mins 2secs Times extracted from como file.
All Core Banking Services and indeed COB, are very resilient and flexible. Agents can be started automatically
or on demand with no disturbance in the process.
As was expected, COB finished successfully with no errors and no differences in the record count, although
it took slightly longer because of the lost time that resulted from the restart of the agents.
During this test the load was not coming from the Web Server / MQ, but was generated by tSA agents, which
were running in the JVM process raised for WebSphere. As a consequence, DM and NA do not have any
impact on the operation of the COB.
Failure of AppDM
The COB is not affected at all. This means that CPU, memory and network are not affected as well. For
example note in the following diagram the CPU activity for the duration of the DM failure:
There is a spike, but this is hardly related to an extraordinary event. Rather, it is part of normal ups and
downs in COB activity, due to changing from CPU intensive to DB intensive tasks.
The only diagram that signifies an exceptional event, is the disk activity. COB is not very intensive on the disk
of the application server, so re-initialising DM produces a disk activity that stands out:
Similarly to the DM failure, NA failure does not affect COB to the slightest. Moreover, disk activity is also not
significant, because the NA by nature is not a heavy process. The only impact in the system is the somewhat
increased process switching during the failure. The following diagram shows the process switches diagram
for App 1.
The blue circle shows the NA failure. The left spike (orange circle), is due to the DM failure. The right spike
(orange circle) is due to the AS failure that will be discussed in Failure of AppAS 1, 2 and 3.
The blue circle shows the NA failure. The orange one, is due to the AS failure that will be discussed in
Failure of AppAS 1, 2 and 3. Finally, we can easily recognise now the failure of NA in App 3:
The blue circle shows the NA failure. The orange one, as usual, the AS failure.
By killing the AS, JVM is killed and all threads are killed as well. This means that all COB services die
unexpectedly. This is directly affecting execution of COB, which is interrupted in the affected server and all
uncommitted transaction blocks due to the unexpected termination of the particular COB services, are rolled
back in the DB. The following graphs shows the App 1 resource impact:
The circled area is the part of the diagram that shows the AS failure. We can see that Temenos Core Banking
services die, and after some brief CPU activity due to the automatic restart and initialisation of AS, the server
stops contributing to the COB service. Note that COB carries on normally because of the agents running in
the rest of the servers in the cluster. After the services manager (TSM) is restarted, we do not observe
immediate activity. This is because TSM takes some time to raise back the required COB agents.
The effects on the DB are not severe, since in each failure scenario, only 1/3 of the agents are affected. CPU is
not affected much because the remaining COB agents continue to compete for faster execution. For example
the following is the CPU diagram of DB 1. Similar is the picture on DB 2 & 3. Note the insignificant CPU
fluctuations during the three AS failure events:
Network is much more affected, since all transaction inflow from 1 of the 3 servers is interrupted. All three
DB servers have similar network utilisation diagrams:
In all three DB servers there is an impact on network traffic, but nothing significant regarding CPU and
memory.
1. DB 1 goes down and stays down for the duration of the COB.
We did not expect to see COB finishing earlier than the baseline test. When COB is executed, the ordering on
the COB jobs is determined at the start of COB. This means that running COB several times, even with an
identical restored DB, under similar conditions, the duration is expected to vary in the range of a few
minutes.
The difference of 2 mins for test #2 can be easily explained by this variation in the COB duration. The
difference of 4 mins for test #1 is more significant. Since during this test one DB node was down for most of
the duration of the test, some of the time difference could be attributed to the decreased cluster
communication overhead between the DB nodes. This is an assumption which could not be verified.
Both tests (shutdown and restart) were successful. The resource utilisation for the active DB nodes does not
change noticeably when the targeted node goes offline.
Both CPU and network demonstrate well the failure event. The CPU utilisation of DB 1 is shown by the
following graph:
The red vertical line shows the time of DB 2 going down. DB 2 on the other hand loses most CPU activity
after the time of the event:
We don't need to show the CPU utilisation diagram for all the Application servers - all of them are similar to
the following (App 1):
In all diagrams, both in DB and App, we notice a dip in activity after two minutes from the DB 2 going down.
This is probably due to rolling back the transactions submitted to DB 2, which in turn temporarily blocks
normal progress of COB.
Both CPU and network demonstrate well the failure event. The CPU utilisation of DB 1 is shown by the
following graph:
The red line represents the time DB 2 goes offline and the green line the time it goes back online. DB 2 on the
other hand loses most CPU activity after the time of the event, until the restart:
There is no reason of showing the CPU utilisation diagram of all Application servers; all of them are similar to
the following (App 1):
The peak in activity at around 13:38 seems too much after the shutdown of DB 2, to be caused by it. Most
probably this is purely operational (for example, a multithreaded, CPU intensive job). This fluctuation is not
reflected in the DB nodes.
l Execute switchover.
l Diversion of online traffic is done by updating the “host” file in the JMeter server, to point to the DR
Load Balancer.
The JMeter results are not important in this particular test, as the errors were dependent only on the duration
of the waiting time during the manual switchover. All transactions before and after the switchover were
successful.
All failures took place during the switchover, as shown by the JMeter TPS diagram.
The missing records exact number is not important in this test. The longer we chose to wait for the manual
switchover, the longer the online traffic would keep failing.
Please note that the adjusted total in this case should be about the same as the total, because although there
was a business reason why some transactions would fail because of a missing customer, in this case we know
that communications were down so all failures were because of that.
Since the traffic is diverted from the live site to the DR site, the resource utilisation diagrams are
complementary; on the left side we have the live servers and on the right the DR ones.
For brevity and because of the Live - DR topology differences, only the primary cluster server from each layer
is compared. The red line on the left diagram is the live site switchover and the green line on the right, is the
DR switchover event.
Load balancers
Web servers
Application servers
DB servers
1 2 .2 Site failover
Site failover test summary:
l Simulate failure: run srvctl stop database –d t24db –o abort, wait a couple of minutes and then
execute failover.
l Diversion of online traffic is done by updating the “host” file in the JMeter server, to point to the DR
Load Balancer.
The JMeter results are not important in this particular test, as the errors were dependent only on the duration
of the waiting time during the failover. The important element of this test is that the failover took place and
normal operation continued with the help of the DR site.
The missing records exact number is not important in this test. The longer we chose to wait for the failover,
the longer the online traffic would keep failing.
Note that the adjusted total should be about the same as the total. Although there's a business reason why
some transactions would fail if there was a missing customer, in this case we know that almost all the failures
were caused by the fact that communications were down.
This test is quite similar to the switchover regarding the resource utilisation. Again, as the traffic is diverted
from the live site to the DR site, the resource utilisation diagrams are complementary - on the left side we
have the live servers and on the right the DR servers.
For the sake of brevity and the live vs DR topology difference, only the primary cluster members from each
layer are compared. The red line on the left diagram is the live site switchover and the green line on the right,
is the DR switchover event.
Load balancers
Web Servers
Application servers
COB was started in servlet mode with two agent running on each of the three live site node. At 15.93%
during the System Wide Stage, switchover was initiated without stopping COB.
After the switchover finished, we raised in servlet mode two COB agents on each of the two application
servers of the DR site. The COB continued normally and with no errors until its completion.
The downtime for switchover was about 6 minutes. Even though the actual switchover took approximately 3
minutes, the database needed at least 2 minutes to get itself into functioning state.
When running COB, only the Application and Data Layers are utilised, so that is what we are going to
concentrate on. For brevity and because the live vs DR topology difference, only the primary cluster servers
from these two layers are compared.
Application Servers
Database Servers
1 3 .2 Site failover
The objective of this test was to execute a failover during the COB. The expected result was that the COB
would be able to finish normally on the DR site and with no errors.
COB was started in servlet mode with two agent running on each of the three live site node. At 72% during
the System Wide Stage, failover was simulated without stopping COB. After the failover finished, we raised
in servlet mode two COB agents on each of the two application servers of the DR site. The COB continued
normally and with no errors until its completion.
The image below shows COB resuming on the DR site after a successful failover process:
When running COB, only the Application and Data Layers are utilised, so that is what we are going to
concentrate on. For brevity and because the live vs DR topology difference, only the primary cluster servers
from these two layers are compared.
Application Servers
Both tests scalability tests (i.e. adding a node in Web and App layers) were executed in the same test. The
setup of the test was with one instance of JMeter, pumping transactions with 10 concurrent users, each one
executing 200 full test cycles. One test cycle is a user login, creation of customer, foreign and local account,
etc. until the logout.
Web 3 17:01:26
App 3 17:06:08
As we observed in previous tests, when we have graceful shutdown of servers, we did not have any missing
DB records. In this instance, we do not have a shutdown but an addition of a server to the cluster, but it
involves no sudden failure, forced restarts or anything other unexpected, so it falls under the same principle.
IHS was able to capture the new node and online traffic was balanced among all three nodes and there was
no impact on the existing traffic.
The Network activity demonstrates the load distribution and the behaviour of the cluster at the time of the
new server addition. The high peaks and low valleys in the following diagrams are part of the normal
operation during the test and they take place even before the node addition, so the reader should not
interpret them as consequences of the node addition.
This diagram shows the activity of the Web 2 node. For convenience there are two marks: the left one is the
Web 3 node addition and the right one is the App 3 node addition that is discussed in 14.2 Adding an App
layer node to the existing cluster. We get a similar picture from the Web 1 network activity diagram but for
clarity it is not presented here.
The following diagram is the network activity from the Web 3 node that was actually added at 17:01:26:
The load from the web layer was uniformly distributed on all app servers in the app layer and there was no
impact on the existing traffic.
As with the test above, we present the network traffic diagram as the most revealing one. Following is the
diagram for App 2, which is part of the cluster for the duration of the test:
The high peaks and low valleys in the following diagrams are part of the normal operation during the test
and they take place even before the node addition, so the reader should not interpret them as consequences
of the node addition.
A short while after the node addition to the Application Layer cluster, it starts picking up load and
contributes as an equal member.
15 Glossary
Term Definition
App1, App2 and App3 are abbreviations for the Application Server 1, 2
Appx
and 3 respectively.
AS1, AS2 and AS3 are abbreviations for the WebSphere Application
ASX Server on Server 1, 2 and 3 respectively, on either the Web layer or App
layer.
BrowserWeb Temenos Web UI, used for accessing the Temenos Core Banking.
DB Database
DB1, DB2 and DB3 are abbreviations for the Database Server 1, 2 and 3
DBX
respectively.
DR Disaster Recovery
HA High Availability
IHS1 and IHS2 are abbreviations for the IBM HTTP Server 1 and 2
IHSX
respectively.
NA1, NA2 and NA3 are abbreviations for the WebSphere Node Agent on
NAX
Server 1, 2 and 3 respectively, on either the Web layer or App layer.
Term Definition
QM Queue Manager
T24 T24 was the initial name for Temenos's Core Banking solution.
Core Banking Service Manager. This is the manager process of all tSA
TSM
(see above) for a particular Application Server instance.
UI User Interface
VM Virtual Machine
Web1, Web2 and Web3 are abbreviations for Web Server 1, 2 and 3
WebX
respectively.