IBMS4 RA HADR Test Report 1.1 PDF

IBM Stack 4 Reference Architecture
HADR Test Report 1.1 August 2018
C r eated by JumpS t art

I B M S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R T e s t R e p o r t 1 . 1
Contents
1 About this test report 6
1.1 Trademark 6
1.2 Legal 6
1.3 Document history 7
1.4 Contributors 8
2 Introduction 9
3 Executive summary 10
3.1 HA tests with online traffic 10
3.2 HA tests with COB 12
3.3 DR tests with online traffic 13
3.4 DR Tests with COB 14
3.5 Scalability testing 15
4 Solution deployment 16
4.1 Solution description 16
4.1.1 WebSphere terminology 17
4.1.2 Architecture naming conventions 18
4.2 Architecture diagram 18
4.2.1 Processes in test series 19
4.3 HA design considerations 20
4.4 DR design considerations 20
4.5 Software used 21
5 Testing approach 22
5.1 Test data 22
5.2 Tools 22
5.3 HA tests with online traffic 22
5.3.1 Test traffic generation 22
5.3.2 Test validation 23
JMeter error count 24
Test execution 24
5.4 HA tests with COB 25
5.5 DR Tests with online traffic 26
5.6 DR tests with COB 26
6 Baseline tests 27
6.1 Baseline test: COB against fresh database 27
6.2 Baseline test: COB on DB with added transactions 27
2 C r eated by JumpS t art

7 Application layer HA Tests 28

7.1 Kill AS processes on the App layer 28
7.1.1 JMeter transaction summary 29
7.1.2 System resources usage 30
7.2 Graceful shutdown and start of AS Processes on App layer 33
7.2.1 JMeter transactions summary 34
7.2.2 DB records summary 34
7.3 Restart App Layer VM Nodes 37
7.3.1 Results summary 38
8 Web layer HA tests 45

8.1 Kill of AS process Web layer 45
8.2 Graceful shutdown and start of AS Processes on Web layer 52
8.2.1 JMeter transactions summary 53
8.3 Deployment Manager failure on App and Web layers 56
8.3.1 Result summary 57
8.4 Node Agent failure on App and Web layers 59
8.4.1 Test summary 63
8.4.2 App layer system resources usage 63
8.4.3 Web Layer system resources Usage 66
8.5 Shutdown and start of IHS processes on Web layer 69
8.5.1 Kill IHS1 procedure 69
8.5.2 Kill IHS2 procedure 70
8.5.3 Results summary 71
8.6 Graceful restart of IHS 73
8.6.1 IHS1 graceful shutdown procedure 73
8.6.2 IHS2 graceful shutdown procedure 75


8.7 Restart of Web layer VM Nodes 78
8.7.3 DB Records summary 80
9 Messaging layer HA tests 85

9.1 IBM MQ failures 85
9.1.1 Kill Queue Manager and MQ VM restart procedure 86
9.1.3 Records count 88
9.1.4 JMeter aggregate summary 89
10 Data layer HA tests 97

10.1 DB node failure 97
10.1.2 Record count 98
10.1.3 Enterprise Manager DB nodes view. 99
10.1.4 Error summary 100
11 COB HA tests 103

11.1 COB under failure condition 103
11.1.1 Kill procedure 103
Failure of AppDM 109
Failure of AppNA 1,2 and 3 111
Failure of AppAS 1, 2 and 3 113
11.2 COB with database VM restart 119
12 DR tests with online traffic 124

12.1 Site switchover 124
12.1.1 Switchover procedure 124
12.1.3 Transactions per second 124
12.1.4 DB Records summary 125

12.2 Site failover 127

12.2.1 Failover procedure 128
12.2.3 Transactions per second 128
13 DR tests with COB 132

13.1 Site switchover 132
13.2 Site failover 134
13.2.1 Test procedure 134
13.2.2 Test summary 134
13.2.3 COB monitor after failover 135
14 Scalability testing: adding physical nodes 136

14.1 Adding a Web layer node to the existing cluster 136
14.1.1 Test result 136
14.2 Adding an App layer node to the existing cluster 138
14.2.1 Test result 138
15 Glossary 140

1 About this test report

1 .1 Trademark
TEMENOS T24 is a registered trademark of the TEMENOS GROUP and referred to as ‘T24’.
IBM is a registered trademark of IBM Corporation and/or its affiliates. Other names may be trademarks of
their respective owners.
1 .2 Legal
© Copyright 2018 Temenos Headquarters SA. All rights reserved.
TM
The information in this guide relates to TEMENOS information, products and services. It also includes
information, data and keys developed by other parties.
While all reasonable attempts have been made to ensure accuracy, currency and reliability of the content in
this guide, all information is provided "as is".
There is no guarantee as to the completeness, accuracy, timeliness or the results obtained from the use of this
information. No warranty of any kind is given, expressed or implied, including, but not limited to warranties
of performance, merchantability and fitness for a particular purpose.
In no event will TEMENOS be liable to you or anyone else for any decision made or action taken in reliance
on the information in this document or for any consequential, special or similar damages, even if advised of
the possibility of such damages.
TEMENOS does not accept any responsibility for any errors or omissions, or for the results obtained from the
use of this information. Information obtained from this guide should not be used as a substitute for
consultation with TEMENOS.
References and links to external sites and documentation are provided as a service. TEMENOS is not
endorsing any provider of products or services by facilitating access to these sites or documentation from this
guide.

The content of this guide is protected by copyright and trademark law. Apart from fair dealing for the
purposes of private study, research, criticism or review, as permitted under copyright law, no part may be
reproduced or reused for any commercial purposes whatsoever without the prior written permission of the
copyright owner. All trademarks, logos and other marks shown in this guide are the property of their
respective owners.
1 .3 Document history
Version Date Change Author
Kaydee Dzvuke and Christos

1.0 June 2018 First release
Tsirkas
1.1 August 2018 Final review Nanda Badrappan

1 .4 Contributors
Temenos
Name Role
Nanda Badrappan Project Manager
Simon Henman Project Manager
Christos Tsirkas Specialist
Kaydee Dzvuke Specialist
Mohand Oussena Performance Architect
Sheeraz Junejo Solution Architect
Yanxin Zhao Database Developer
Raviteja Penki Specialist
IBM
Name Role
Adam Deleeuw Technical Consultant
Ian Marshall UK Business Partner, Solutions Hub
Jag Jhajj Business Solutions Architect

2 Introduction
This IBM Stack 4 Reference Architecture HADR Test Report presents the results of the High Availability
(HA) Disaster Recovery (DR) and scalability testing that Temenos carried out on its Core Banking System.
We tested the architecture using MQ connectivity between the web and application (App) layers. The
software was deployed in line with the R17 Stack 4 for AIX / WebSphere. The stack is supported by Temenos
for all post R16 AMR releases, up to and including R17 AMR.

3 Executive summary
We carried out high availability, disaster recovery and scalability testing on the Temenos Core Banking
system. The tested architecture is n-tier clustered with manual failover to DR. It comprises four layers:
l Web
l Message
l Application
l Data
3 .1 HA tests with online traffic

The tests show that the solution is highly available and recovers within a few seconds with only a few errors.
The error count shown in the table relate to a total of 5000 transactions. The errors reported by JMeter
correspond with those a real end-user would experience using the system.
Cluster load balancing was found to function well during all failure scenarios.
The numbers in the table are totals. For example, in the Host reboot entry, the 11 JMeter errors are the total
errors caused by all three Virtual Machine (VM) reboots.
The infrastructure behaved as expected under all online failure scenarios. These tests involve the most
demanding and non-graceful events, and almost always result in the loss of en route transactions.

%
Average Availability
DB missing
HA test description Test details JMeter errors downtime (Assumes
errors
(seconds) 100 events
per year)
Web layer
Kill IBM HTTP Server

1 1 0 99.9999%
IBM HTTP load (IHS)
balancer restart Graceful shutdown one
0 0 0 100%
of the IHS
Kill 1 Deployment
0 0 0 100%
Deployment Manager Manager (DM) process
and Node Agents Kill and restart one by
failures one, all Node Agents 0 0 0 100%
(NA)
Reboot all Web Layer

hosts after the previous
Host Reboot 12 5 162 99.9997%
host has recovered and
all the processes running
App layer
Kill all Application
Server (AS) processes 1 1 7 99.9999%
one by one.
Graceful shutdown of
0 0 220 100%
AS
Kill Deployment
Application server 0 0 0 100%
Manager (DM)
Failure
Kill NA and restart it
0 0 0 100%
and repeat on all NAs
Reboot all App Layer

hosts after the previous
7 5 170 99.9998%
host has recovered and
all the processes running
Data layer

%
Average Availability
DB missing
HA test description Test details JMeter errors downtime (Assumes
errors
(seconds) 100 events
per year)
Graceful DB node
Node shutdown 45 45 210 99.9990%
shutdown
Messaging layer
Kill Queue Manager

Kill Queue Manager 30 30 17 99.9993%
(QM) and Failover
Restart QM host Reboot one QM host 1 1 31 99.9999%
The availability percentage is calculated by assuming one hundred such events over a year and 15
transactions per second (tps) average throughput, as we measured it during the test.
We used the following formula:
3 .2 HA tests with COB

COB is an End Of Day that is multithreaded, transactional, service driven and does not bring the system
offline while it’s running. The following tests were designed to find out how the system behaves under all
failure scenarios when executing Close Of Business (COB).
COB was started in servlet mode within WebSphere, with two COB agents (tSA) per server, resulting in 6
worker threads in total. The service management in the Temenos Core Banking is transactional and extremely
flexible by design. As expected, all the tests finished without any errors.
In all cases, the uncommitted transaction blocks that resulted from the failure scenarios were rolled back in
the DB and execution resumed from whatever available resources were still available.

HA test COB duration
Test details
description (minutes)
Kill all AS, DM, NA and IHS one by one but only when previous kill has been
COB Under 31
restored.
Failure
Condition Shut down one of the RAC nodes and bring it back up. 31
The COB times are to be compared to baseline tests with 6 tSAs that took 30 min. The duration differences
were not significant because:
l The down-time duration was not long enough.
l Resource utilisation did not approach 100%.
3 .3 DR tests with online traffic

In these test scenarios, the system switches execution from Live to the DR site. This is caused by either
scheduled maintenance (switchover) or a catastrophic event (failover).
Downtime in the table is the time period when a service is unavailable. Downtime includes an additional
voluntary delay -about 2 minutes - between killing the DB and triggering the failover. The test results were as
expected, with failed transactions occurring only for the duration of the system downtime.
In the following two tables, we list the availability percentage, which is calculated by assuming 100 such
events over the period of a year.
The Recovery Time Objective (RTO) is the duration of time and a service level within which a business
process must be restored after a disaster in order to avoid unacceptable consequences associated with a break
in continuity. The RTO is listed in the following two tables and it is essentially the time between the first and
last errors in JMeter.

Site switchover with online traffic
Downtime
%
DR test description Test details Load balancer
DB failover RTO Availability
switchover
Online traffic on live Switchover from ~270

2 mins 29 secs ~0 secs 99.91%
site. live site to DR site. secs
Site failover with online traffic
Downtime
%
DR test description Test details Load balancer
DB failover RTO Availability
failover
Shutdown live DB
~242
Online traffic on live site. and execute a 2 mins 22 secs ~0 secs 99.92%
secs
manual failover.
3 .4 DR Tests with COB

These tests were designed to find out how the system behaves under switchover or failover, while running
the COB. Similar to the tests in 3.2 HA tests with COB, there were no failed transactions and the COB to
finish successfully both after a switchover and a failover to DR.
Site failover during COB
Duration
DR test description Test details
COB DB failover
COB on Live Site Start COB, shutdown Live DB and execute a manual failover. 31 mins 1 min 47 secs

Site switchover during COB
Duration
DR test description Test details
COB DB switchover
Start COB on Live DB and execute a

COB on Live Site 37 mins 2 min 34 secs
manual switchover.
3 .5 Scalability testing
The scalability tests checked the elasticity of the infrastructure by adding a new Web node and a new
Application (App) node. Both tests were successful and as expected the new node successfully joined the
cluster and received traffic according to the load-balancing rules.
Adding physical nodes
SC test
Test details Results
description
Add a new Web layer node to Traffic distributed to the new Web Application Server (AS) as part of
Online traffic the existing cluster. the round-robin load balancing process.
on live site. Add a new app layer node to
New app node received traffic generated from the Web layer nodes.
the existing cluster.

4 Solution deployment
4 .1 Solution description
Our High Availability solution is a 4-tier architecture, comprising:
l A Web layer.
l An App layer.
l A messaging layer.
l A Data layer.
On the Live site, both the App layer and Web layer are in two separate clusters, each containing three
application servers (AS). Two of the Web layer nodes also have IBM HTTP Server (IHS) configured to
forward in a round-robin fashion the http requests across all three Web layer nodes. Session replication is
enabled so that messages are not lost in case one of the node fails.
An external Apache Load Balancer (LB) is configured on a separate physical machine to forward incoming
http requests to both IHS instances. Traffic received:
l In the live LB is serviced by the live site.
l In the DR LB is serviced by the DR site, provided the DB in the site is active.
The switch between sites was simulated on the level of the JMeter host, by changing the mapping of the IP in
the “hosts” file.
The messaging layer contains two physical hosts, each with a Queue Manager configured in active/standby
mode so that when the active one fails the standby automatically becomes active. The failed QM will need to
be restarted so that it becomes standby.
The same set up also applies to the DR site, except that the Web and App layers have two instead of three
nodes in their respective clusters.

The Data layer of the DR site is serviced by an Oracle RAC database (DB) with two nodes. This DB is
connected through Oracle’s Data Guard technology to the Live site and added to the host schema, as
required by Oracle technology. The RAC database contains the Temenos Core Banking schema only. The
database on DR site is kept in sync with the live site database using Oracle’s DataGuard technology.
4.1.1 W ebSphere term inology

Term Definition
Deployment Manager (DM) This is the administrative process used to provide a centralized
management view and control for all elements in a WebSphere
Application Server distributed cell, including the management of
clusters.
The DM is responsible for the contents of the repositories on each of the

nodes. This is managed this through communication with node agent
processes on each node of the cell.
In our infrastructure, we had two DMs setup for live and two for DR. In
both sites we had one DM for the Web Layer cluster and one for the
Application Layer (App) cluster. Both of these DMs were setup in the
primary server of each cluster (i.e. Web1 and App1).
Node Agent (NA) An NA manages all processes on a WebSphere node by communicating

with the DM to coordinate and synchronize the configuration. An NA
performs management operations on behalf of the DM. Essentially, the
NA represents an individual node in the cluster.
In our tested deployment, we have one NA per application server

(discussed below).
Application Server (AS) The AS is the primary component of WebSphere. The server runs a Java
Virtual Machine (JVM), providing the runtime environment for the
Temenos code. In essence, the AS provides containers that specialize in
enabling the execution of Temenos libraries and components.
In the tested deployment, we had one AS per node and one node per
host server.

4.1.2 Architecture nam ing conventions

Architecture Naming convention
Web layer hosts Referred to as Web followed by an increment: Web1, Web2, and Web3.
Application layer hosts Referred to as App followed by an increment: App1, App2 and App3.
Data layer hosts Referred to as DB followed by an increment: DB1, DB2 and DB3.
Message layer hosts Referred to as MQ followed by an increment: MQ1 and MQ2.
IBM HTTP Server (IHS) Referred to as IHS followed by an increment: IHS1 and IHS2.
For ease of reference, we may refer to a WebSphere process followed by the host increment. For example,
AS1 would mean AS on host #1. The DR site follows the same conventions as above.
4 .2 Architecture diagram

4.2.1 Processes in test series
The following processes were running in the live site as part of our test series.
Live site Process
Load Balancer (LB) Apache server
Web1 DM, NA, AS (T24Browser deployed) and IHS
Web2 NA, AS (T24Browser deployed) and IHS
Web3 NA and AS (T24Browser deployed)
MQ1 QM active
MQ2 QM passive
App1 DM, NA and AS (T24 & TAFJ deployed)
DB1 Oracle RAC
DB2 Oracle RAC
DB3 Oracle RAC
The following processes were running in the DR site as part of our test series.
Live site Process
Load Balancer (LB) Apache server
Web1 DM, NA, AS (T24Browser deployed) and IHS
Web2 NA, AS (T24Browser deployed) and IHS
MQ1 QM active
MQ2 QM passive
App1 DM, NA and AS

Live site Process
App2 DM, NA and AS
DB1 Oracle RAC
DB2 Oracle RAC
4 .3 HA design considerations
Architecture Design consideration
External load balancer As the recommended highly available hardware load balancer was not
available, two Apache servers for the Live and DR site where configured
on two separate physical hosts.
For the online traffic to get access, either on the live site or DR site, a
Windows VM host file needed to be updated with the desired URL/IP of
the machine where the Apache server was installed.
Web layer Temenos BrowserWeb was deployed on all three Web Layer cluster
members. Online requests coming from the load balancer were
distributed across all the Web Layer nodes using the IHS http load
balancer. Session replication was enabled in IHS and no additional
modifications needed to be done in BrowserWeb parameters file.
App layer The Temenos Core Banking (T24) and Application Framework (TAFJ)
libraries, were installed in all 3 app layer cluster members.
We placed both T24 and TAFJ runtime libraries on each cluster member,
as opposed to separated shared storage. There are two IP addresses
provided for the 2 active/standby Queue Managers.
Queue Managers were configured in active/passive mode. Each QM runs

Message layer
on its own physical host.
HA at the Data layer was achieved through a combination of an

Data layer RAC database with three nodes, an App layer that uses a URL with
SCAN addresses to point to the database, and a generic data source type.
4 .4 DR design considerations
The live and DR sites are configured in Active/Standby mode.

An infrastructure database has been added to host schema required by Oracle technology. The RAC database
contains the Temenos Core Banking schema only. The database on DR site is kept in sync with the live site
database using Oracle’s DataGuard technology.
4 .5 Software used
Temenos
Software Version
T24 R17
TAFJ R17_SP2
TAFJ Java Functions PB201704 03/21/2017
IBM
Software Version
WebSphere 8.5.5.13
IBM HTTP Server(IHS) 8.5.5.12
IBM MQ 8.0.0.7
IBM Java Developer Kit (JDK) 1.7.0
Apache
Software Version
Apache HTTP Server 2.4
Oracle
Software Version
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0

5 Testing approach
The reference architecture exercise focused on the following three areas, each with a set of specific test cases
to be carried out:
l Availability testing.
l Disaster recovery testing.
l Scalability testing.
5 .1 Test data
We used custom JMeter scripts to commit transactions through the Web User Interface (UI) of the Temenos
Core Banking. The total number of script cycles were 1000.
See also 5.3 HA tests with online traffic.
5 .2 Tools
JMeter and Nmon
5 .3 HA tests with online traffic
5.3.1 Test traffic generation
Traffic is generated by JMeter scripts running on a Windows VM.
The requests are sent to the Apache load balancer, which forwards requests to IHS instances on the two Web
Layer nodes. Fifty users are configured, and five JMeter instances each execute execute 10 concurrent threads
for 10 users. A sample file feeds 1000 JMeter script cycles for processing.
Each user iterates 20 times, which means that the 50 users execute the 1000 loops, drilling data from the
sample file.
Each thread executes sessions that carry out the following tasks:

1. Go to the login page.
2. Login.
3. Create customer
4. Create two accounts(foreign and local) for the customer.
5. Open till.
6. Make a deposit on local account.
7. Make a deposit on foreign account.
8. Get the Account Balance (ACCT.BAL.TODAY).
9. Make two statements requests(STMT.ENT.BOOK).
10. Logoff.
A constant throughput timer is used to limit throughput to 5 Transactions per Second (TPS). Each JMeter
testing cycle was executed in its own session.
JMeter has been configured to stop the session in case of failure occurring during execution and start a new
one.
5.3.2 Test validation
The JMeter scripts have robust response assertions. In addition, at the end of every test run, the following
JQL scripts are executed against the database to count the total number of records that have been inserted:
COUNT FBNK.CUSTOMER WITH CUSTOMER.NO LIKE 820...

COUNT FBNK.ACCOUNT WITH ACCOUNT.NO LIKE 14...
COUNT FBNK.TELLER WITH @ID LIKE TT171148...
COUNT FBNK.TELLER WITH @ID LIKE TT171147...
The COUNT queries are designed to return all the records that were inserted by the
JMeter scripts (and only these records).

JMeter error count
Every thread on JMeter executes the requests sequentially. If during a failure test the login page failed, then
all subsequent transactions will fail. To avoid registering these additional errors, which are a consequence of
the previous step, JMeter has been configured to stop the session and start a new one. Errors logged by
JMeter represents errors caused by the failure test.
Errors reported by JMeter reflect what a real end user would see.
Test execution
While traffic is running, the following commands are executed:
IHS kill
date;ps -ef | grep httpd | grep -v Xvnc | grep -v grep | awk '/http/ {print $2}'
date;i=$(ps -ef | grep httpd | grep -v Xvnc | grep -v grep | awk '/http/ {print $2}');for x in $i;
do kill -9 "$x";done
IHS graceful shutdown
apachectl -k graceful
Deployment Manager kill
date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'

date;kill -9 $(ps -ef | grep java | awk '/dmgr/ {print $2}')
Node Agent kill
date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'

date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
App layer Application Server kill
date;ps -ef | grep java | awk '/AppSrv/ {print $2,$(NF-1), $(NF)}'

date;kill -9 $(ps -ef | grep java | awk '/AppSrv/ {print $2}')

Web Layer Application Server kill
date;ps -ef | grep java | awk '/WebSrv/ {print $2,$(NF-1), $(NF)}'

date;kill -9 $(ps -ef | grep java | awk '/WebSrv/ {print $2}')
Queue Manager kill
date; ps -ef | grep "QM1 -x" | grep -v NF | grep -v grep | awk '/QM1/ {print $2,$(NF-1), $(NF)}'
date;kill -9 $(ps -ef | grep "QM1 -x" |grep -v NF | grep -v grep | awk '/QM1/ {print $2}')
DB Node shutdown
While traffic is running, the following command is used to shut down the DB service.
su – oracle
sqlplus / as sysdba
shutdown transactional;
Graceful shutdown of AS
While traffic is running, use WebSphere console to gracefully shut down and then start the application
servers one at a time.
Restart of VMs
While traffic is running, use the reboot command to restart the relevant box.
5 .4 HA tests with COB

While COB is running, kill AS nodes, Admin server, Node Agent and IHS one by one but only when the
previous kill has been restored. A DB node is shutdown during COB.

5 .5 DR Tests with online traffic

To run DR tests with online traffic:
1. Generate traffic with JMeter and let it run for 5 minutes.
2. Switchover or failover the DB, depending on the test:
a. If switchover test: execute switchover.
b. If failover test: run srvctl stop database –d t24db –o abort, wait a couple of minutes and
then execute failover.
3. Update the Windows host file to DR load balancer and then disable routing to live site load balancer.
5 .6 DR tests with COB

To run DR tests with COB:
1. Start COB on LIVE site and wait for until it gets to the Application stage.
2. Switchover or failover DB, depending on the test.
a. If switchover test: stop COB on live site first and then execute switchover.
b. If failover test: run srvctl stop database –d t24db –o abort, wait a couple of minutes and
then execute failover.
3. Continue COB on DR server.
a. Set Temenos Service Manager (TSM) and COB services to START.
b. Execute START.TSM from servlet.

6 Baseline tests
Running baseline tests involved:
1. Running a COB with no transactions added.
2. Going through 5000 JMeter cycles to populate the DB with records.
3. Creating a DB restore point which could be used for other COB tests.
4. Running the COB again and keeping measured performance as a baseline.
6 .1 Baseline test: COB against fresh database

COB was started in servlet mode with two COB agents running on all the three app layer nodes.
Start time End time Baseline duration Comments
09:36:44 10:07:46 31 mins 02 secs Times extracted from COMO files.
6 .2 Baseline test: COB on DB with added transactions

Before running COB, 5000 JMeter script cycles injected transactions to the system. COB was started in servlet
mode with two COB agents running on each of the three app layer nodes.
As the table shows, after 5000 JMeter script cycles, COB took nearly two minutes longer to complete. There
were no errors recorded.
Start time End time Baseline duration Comments
11:03:07 11:36:14 33 mins 14 secs Times extracted from COMO files.

7 Application layer HA Tests

7 .1 Kill AS processes on the App layer
During the test, all AS processes on the App layer were killed one at a time, giving enough time for them to
restart before killing the next. The expected behaviour for this test was as follows:
l NA should restart AS automatically.
l Traffic will be routed to the remaining nodes until failed AS has fully restarted.
l Traffic will be balanced between all nodes again once failed AS has fully recovered.
l Minimal traffic disturbance is to be expected.
KillAppAS.processes
APP PR1
[eg15ph09:root] / # date;ps -ef | grep java | awk '/AppSrv0/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 10:22:13 GMT 2018
8388872 eg15ph09Node01 AppSrv01
[eg15ph09:root] / # date;kill -9 $(ps -ef | grep java | awk '/AppSrv0/ {print $2}')
Tue Feb 20 10:24:38 GMT 2018
Process back up at: 10:24:41
=================== APP PR2
Tue Feb 20 10:27:48 GMT 2018
Tue Feb 20 10:27:56 GMT 2018
================================== APP PR3
Tue Feb 20 10:29:42 GMT 2018
Tue Feb 20 10:29:47 GMT 2018
Kill AS process Time Recovery time JMeter errors Adjusted errors Downtime (Approx. secs)
AppSrv01 10:24:38 10:24:41 5 1 3
AppSrv02 10:27:48 10:28:01 0 0 13
AppSrv03 10:29:42 10:29:48 0 0 6

7.1.1 JM eter transaction sum m ary
The image below shows a series of JMeter errors. Killing the AS in App1 server caused the Customer error,
which in turn triggered the Account and Cash deposit errors. The killing of AS in App2 and App3 servers
produced no errors.
The errors after 9:27 are due to the restart of the AS in the Web Servers, which is discussed separately in 8.1
Kill of AS process Web layer.
Transaction Expected Actual Missing Total Adjusted total
CUSTOMER 6000 5999 1
ACCOUNT 12000 11998 2 5 1
CASH DEPOSITS 12000 11998 2
Because one customer failed, the missing accounts comprise two accounts (local and foreign) and two cash
deposits (local and foreign). That means the adjusted total is 1.

7.1.2 System resources usage
Application Server 1 behaviour
At 10:24 the AS process was killed. This led to low CPU activity. The process was automatically restarted,
hence the increased CPU activity, due to initialisation.
The above diagram demonstrates what is happening to the Application Server.

After the killing of the AS process, network activity slows down to almost zero. Immediately after the restart
of the process, there is increased network activity due to message inflow and the initialisation of queues.
Similar behaviour is observed in the Application Servers 2 and 3.


7 .2 Graceful shutdown and start of AS Processes on App layer

While JMeter was running, the app servers were individually stopped and then restarted from the
WebSphere console. The expected behaviour for this test was as follows:
l Traffic will be routed to the remaining nodes until stopped AS has been restarted manually.
l Traffic will be balanced between all nodes once stopped AS has fully recovered.
l No traffic disturbance is expected.
GracefulShutDownAppLayerAS.txt
Stopping the App server from the WebSphere Web console

=====================================================
Then checking the start and up times using this command
on each App layer VM
====================================================
cd $NODE_HOME/logs/nodeagent/
grep "Detected server AppSrv" * | cut -f 2,11,12,13 -d " "
12:47:09:195 server AppSrv01 stopped
12:51:53:605 server AppSrv01 started
Graceful shutdown Recovery JMeter Adjusted Downtime (Approx.

Time
AS time errors errors minutes)
AppSrv01 12:47:09:195 12:51:53:605 0 0 5
AppSrv02 12:52:22:680 12:55:22:631 0 0 3
AppSrv03 12:58:13:919 13:01:15:025 0 0 3
Because this was a graceful kill, the app server had to wait for any transactions to be processed before it shut
down, which is why no errors were recorded.

7.2.1 JM eter transactions sum m ary
Graceful kill occurred during the highlighted period in the below image. The small disturbance at the start of
the test occurred outside the testing window and are due to no proper ramping up of the JMeter thread
execution.
7.2.2 DB records sum m ary

ACCOUNT 12000 12000 0 0 0
A graceful restart resulted in no data loss being experienced as all pending transactions where completed
first, and incoming transactions were routed to the remaining nodes.

The system resources usage is similar to that shown in 7.1.2 System resources usage. The only difference is
that due to the graceful shutdown, there is no automatic restarting of AS, which means there's time for the
activity to settle down to near-zero levels.


7 .3 Restart App Layer VM Nodes

While JMeter was running, all App layer hosts were rebooted one by one. The expected behaviour was:

l Some disturbance is expected.
l Cluster should recover to a normal state once the rebooted host has recovered.
App layer hosts Process affected JMeter errors Adjusted errors Time
Host 1 DM, NA, AS, 8 3 15:48:51
Host 2 NA and AS, 6 2 15:52:54
Host 3 NA and AS 4 2 15:56:25
7.3.1 Results sum m ary
The server reboot of App Layer nodes while injecting transaction through JMeter resulted in some loss of
data.

The image below shows the three events of the application servers’ reboot. The disturbance at the beginning
of the test is due to the sudden activity from multiple threads.
We corrected the stress method in subsequent tests by having a two-minute ramp-up period where
WebSphere was allowed some time to initialise its connection pools.

Because 3 customers failed, we would expect 6 accounts and 6 cash deposits records to be missing. Instead,
there were an additional 2 account and 2 cash deposits records missing. The 2 missing cash deposits records
are explained by the 2 missing accounts. Therefore, the adjusted total is 3+2+0=5.
ACCOUNT 12000 11992 8 19 5
We have captured uninterrupted resource utilisation diagrams that encompass the VM restart event.
Although impact was obvious in other diagrams as well, we only list the 2 most interesting ones: Disk and
network activity.

Application Server 1
The actual restart was executed at 15:48:51 and finished at 15:48:52. The server was again operationally ready
at 15:51:30. The initialisation of the various resources of the VM can be clearly seen by this big spike in the
disk activity.
The opposite behaviour is observed as expected in network activity. While the VM is restarting or initialising
resources, communications will decrease; after that we expect an increase higher than before the restart, as
the listeners try to consume all remaining messages that did not timeout.

Similar behaviour was exhibited by the other two Application Servers.
The restart was executed at 15:52:54 and finished at 15:52:55. The server was again operationally ready at
15:55:51. The behaviour is similar to Application server 1.

The restart was executed at 15:56:26 and finished at 15:56:27. The Server was again operationally ready at
15:59:25. The behaviour is similar to Application Server 1 and 2.


8 Web layer HA tests

8 .1 Kill of AS process Web layer
During this test we killed all three AS processes, one at a time, allowing the killed process to recover before
moving on to the next one. The expected behaviour for this test was as follows:
l NA should restart AS automatically.
l Traffic will be routed to the remaining nodes until failed AS has fully restarted.
l Traffic will be balanced between all nodes again once failed AS has fully recovered.
l Minimal traffic disturbance is expected.
KillWebAS.processes
================================== WAS PR1

[eg15ph06:root] / # date;ps -ef | grep java | awk '/WebSrv0/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 10:31:53 GMT 2018
8192258 eg15ph06Node01 WebSrv01
[eg15ph06:root] / # date;kill -9 $(ps -ef | grep java | awk '/WebSrv0/ {print $2}')
Tue Feb 20 10:32:13 GMT 2018
Process back up at:10:32:15
ERROR 8 session timeouts for user DEMOTEST16 during its current jmeter loop.
================================== WAS PR2
Tue Feb 20 10:35:18 GMT 2018
Tue Feb 20 10:35:32 GMT 2018
Process back up at:10:35:34
================================== WAS PR3
Tue Feb 20 10:37:01 GMT 2018
Tue Feb 20 10:37:04 GMT 2018
Kill Application Server Recovery JMeter Adjusted Downtime (Approx.

Time
process time errors errors secs)
WebSrv01 10:32:13 10:32:15 10 2 2
WebSrv02 10:35:32 10:35:34 1 1 2

Kill Application Server Recovery JMeter Adjusted Downtime (Approx.

Time
process time errors errors secs)
WebSrv03 10:37:04 10:37:06 1 1 2
There were some session timeout errors recorded in JMeter as the Web Layer AS process went down.
The highlighted failed transaction spikes represent the two failed customer creations, one of which is in the
screenshot above. These were due to the killed AS in Web1. The last 2 spikes are due to the killed AS in Web2
and Web3 servers.


ACCOUNT 12000 11996 4 11 3
From the above table, since 2 customers failed, we expected that to be followed by 4 accounts and 4 cash
deposits missing records. Instead, there was an additional 1 cash deposit missing. That means the adjusted
total is 2+0+1=3.
Restarting the Web Server did not have an impact on most resource utilisation diagrams. The CPU
disturbance was in the level of magnitude of other random events, while in operation. The reason behind this
is that the servers are sized to handle much bigger traffic loads and the CPU can handle easily all tasks.
Network utilisation is not helpful either, because the calls keep coming to the VM, whether the Web Server is
up or not. There are peaks in the diagram, but not significant enough to demonstrate an extraordinary event.

The disk diagrams on the other hand are a different case. There we can see the restart of the AS process quite
nicely. Following, are disk diagrams along with the CPU and network, as an interesting comparison between
Web and Application Server behaviour. See also 7 Application layer HA Tests
Web Server 1

Web Server 2


Web Server 3

8 .2 Graceful shutdown and start of AS Processes on Web layer

While JMeter was running, the app servers were individually stopped and then restarted from the
WebSphere console. The expected behaviour for this test was as follows:
l Traffic will be routed to the remaining nodes until stopped AS has been restarted manually.
l Traffic will be balanced between all nodes once stopped AS has fully recovered.
l No traffic disturbance is expected.
GracefulShutDownWebLayerAS.txt
Stopping the App server from the WebSphere Web console

=====================================================
Then checking the start and up times suing this command
on each Web layer VM
====================================================
cd $NODE_HOME/logs/nodeagent/
grep "Detected server AppSrv" * | cut -f 2,11,12,13 -d " "
13:04:07:871 server WebSrv01 stopped
13:07:19:926 server WebSrv01 started

Kill AS process Time Recovery time JMeter errors Adjusted errors Downtime (Approx. mins)
WebSrv01 13:04:07:871 13:07:19:926 0 0 3
WebSrv02 13:09:08:705 13:11:43:097 0 0 3
WebSrv03 13:13:14:274 13:18:28:837 0 0 4
A graceful shutdown of Web Layer AS process did not cause any loss of transactions. The below screenshot
shows healthy JMeter results tree as seen for the duration of the test.
8.2.1 JM eter transactions sum m ary
Graceful kill occurred during the highlighted period in the below image. The small disturbance at the start of
the test occurred outside the testing window and are due to no proper ramping up of the JMeter thread
execution.


ACCOUNT 12000 12000 0 0 0
A graceful restart resulted in no data loss being experienced as all pending transactions where completed
first, and incoming transactions were routed to the remaining nodes.
The system resources usage here is similar to that shown in 8.1.3 System resources usage. The only
difference is that due to the graceful shutdown, there is no automatic restarting of AS, and thus time for the
activity to settle down to near-zero levels.
The network is not affected, unlike the Application Server shutdown, due to the continuous requests coming
from the injector. In the case of Application Server, the messages are pulled from MQ, not pushed by an
injector.
Note the low CPU utilisation (5-10%) because of the over-sized capacity of the Web Server VMs.

Web Server 1
Web Server 2

Web Server 3
8 .3 Deployment Manager failure on App and Web layers

The expected behaviour from this test was:
l The admin console will be down.
l No traffic disturbance should occur and there would be no impact on existing application servers.
l The admin console should become available again once the DM has fully recovered.
dmkill.txt
Test-AT3a times and commands

app -dm============================
[eg15ph09:root] / # date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:44:22 GMT 2018
8651064 eg15ph09CellManager01 dmgr
9634080 dmgr -dmgr
[eg15ph09:root] / # date;kill -9 $(ps -ef | grep java | awk '/dmgr/ {print $2}')

Tue Feb 20 14:44:28 GMT 2018
was-dm=================================
Tue Feb 20 14:46:49 GMT 2018
9699660 dmgr -dmgr
Tue Feb 20 14:46:53 GMT 2018
Kill process Time
App Layer DM 14:44:28
Web Layer DM 14:46:53
8.3.1 Result sum m ary
As expected, the WebSphere console was not accessible when the Deployment Manager process was down
and there was not traffic disturbance.

The Deployment Manager is not responsible for any traffic, so the network is not affected. CPU and disk
activity though is, since killing and restarting the process entails initialisation of resources and the update of
state from the servers in the cluster.
Web Server 1

8 .4 Node Agent failure on App and Web layers

The test was executed on all NAs in this sequence:

l Kill NA process on one node.
l Wait couple of minutes and restart it manually.
l Do the same on the other nodes.
The expected behaviour was:
l No traffic disturbance should occur and no impact on existing ASs or the Admin server.
l The admin console should stay available.
Kill_NA_on_App_n_Web_layer.txt
Test-AT3a times and commands

app -dm============================
Tue Feb 20 14:44:22 GMT 2018
9634080 dmgr -dmgr
Tue Feb 20 14:44:28 GMT 2018
was-dm=================================
Tue Feb 20 14:46:49 GMT 2018
9699660 dmgr -dmgr
Tue Feb 20 14:46:53 GMT 2018
Test-AT3b times and commands

date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
date;/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/bin/stna.sh
=========app pr1
[eg15ph09:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:50:16 GMT 2018
9765202 eg15ph09Node01 nodeagent
[eg15ph09:root] / # date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Feb 20 14:50:27 GMT 2018
[eg15ph09:root] / # (NF-1), $(NF)}'date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1),
$(NF)}' <
Tue Feb 20 14:51:24 GMT 2018
[eg15ph09:root] / # date;/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/bin/stna.sh
Tue Feb 20 14:51:51 GMT 2018
launching server using startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent

ADMU3200I: Server launched. Waiting for initialization status.

ADMU3000I: Server nodeagent open for e-business; process id is 8061268
exit code: 0
[eg15ph09:root] / #
====================app pr2==========
Tue Feb 20 14:53:58 GMT 2018
Tue Feb 20 14:54:16 GMT 2018
Tue Feb 20 14:54:24 GMT 2018
Tue Feb 20 14:55:04 GMT 2018
exit code: 0
[eg15ph10:root] / #
============app pr 3==========================
Tue Feb 20 14:56:15 GMT 2018
Tue Feb 20 14:56:32 GMT 2018
Tue Feb 20 14:56:38 GMT 2018
Tue Feb 20 14:57:50 GMT 2018
=was pr 1====================
Tue Feb 20 15:00:34 GMT 2018
Tue Feb 20 15:00:36 GMT 2018
Tue Feb 20 15:00:44 GMT 2018
Tue Feb 20 15:01:55 GMT 2018
exit code: 0
was pr 2===================

Tue Feb 20 15:02:39 GMT 2018

Tue Feb 20 15:02:42 GMT 2018
Tue Feb 20 15:03:35 GMT 2018
Tue Feb 20 15:04:12 GMT 2018
exit code: 0
[eg15ph07:root] / #
=======waspr 3============
Tue Feb 20 15:05:23 GMT 2018
Tue Feb 20 15:05:52 GMT 2018
Tue Feb 20 15:06:00 GMT 2018
[eg15ph08:root] / #
Tue Feb 20 15:07:29 GMT 2018
exit code: 0
[eg15ph08:root] / #
Kill process Time JMeter errors Adjusted errors
Kill Process Time JMeter Errors Adjusted Errors
AppNA1 14:50:27 0 0
AppNA2 14:54:16 0 0
AppNA3 14:56:32 0 0
WebNA1 15:00:36 0 0
WebNA2 15:02:42 0 0
WebNA3 15:05:52 0 0

8.4.1 Test sum m ary
There was no data loss resulting from kill node agents.
8.4.2 App layer system resources usage

The disk utilisation diagram for Application Server 1 is significantly different from Server 2 and 3. This is
because, Server 1 is running the Deployment Manager as well.

Notice the difference in the disk utilisation profile in comparison to Application Server 1.

8.4.3 W eb Layer system resources Usage
Web Server 1

The disk utilisation diagram for Web Server 1 is significantly different from Servers 2 and 3. This is because
Server 1 is running the Deployment Manager as well.
Web Server 2

Notice the difference in the disk utilisation profile, in comparison to Web Server 1.
Web Server 3

8 .5 Shutdown and start of IHS processes on Web layer

For this test, IHS was configured to restart automatically in case of failure.
With IHS configured to restart itself upon failure and session replication enabled in IHS, JMeter was started
and the following behaviour was expected:
l Traffic will be routed to the remaining IHS node until failed IHS has fully recovered.
l Traffic will be balanced between the two IHS nodes once failed IHS has fully recovered.
l WebSphere app servers will remain balanced.
l There should be no traffic disturbance when session replication is in place.
8.5.1 Kill IHS1 procedure
kill-shutdown-webserver1.txt
[eg15ph06:root] / # grep | awk '/http/ {print $2}');for x in $i; do echo "$x";done

<
Wed Feb 14 14:23:23 GMT 2018
6291944
7995682
8454518
9109848
9896296
10223938
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / # grep | awk '/http/ {print $2}');for x in $i; do kill -9 "$x";done
<
Wed Feb 14 14:23:51 GMT 2018
[eg15ph06:root] / # ps -ef | grep http
root 6619546 1 0 14:08:00 - 0:00 Xvnc :1 -desktop X -httpd
/opt/freeware/vnc/classes -auth //.Xauthority -geometry 1024x768 -depth 8 -rfbwait 120000 -rfbauth
//.vnc/passwd -rfbport 5901 -nolisten local -fp
/usr/lib/X11/fonts/,/usr/lib/X11/fonts/misc/,/usr/lib/X11/fonts/75dpi/,/usr/lib/X11/fonts/100dpi/,/u
sr/lib/X11/fonts/ibm850/,/usr/lib/X11/fonts/Type1/
nobody 6750532 8061240 0 14:25:55 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
root 8061240 1 0 14:25:53 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d


root 9109850 10748376 0 14:26:30 pts/0 0:00 grep http
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / # cd cd /u01/3rdParty/As/IBM/HTTPServer/bin/
ksh: cd: 0403-011 The specified substitution is not valid for this command.
[eg15ph06:root] / # cd /u01/3rdParty/As/IBM/HTTPServer/bin/
[eg15ph06:root] /bin #
Time JMeter errors Adjusted errors
14:23:51 7 1
8.5.2 Kill IHS2 procedure
kill-shutdown-webserver2.txt
[eg15ph07:t24user] /Temenos $ date;i=$(ps -ef | grep httpd | grep -v Xvnc | grep -v grep | awk
'/http/ {print $2}');for x in $i; do echo "$x";done
Wed Feb 14 14:30:22 GMT 2018
4653514
6422986
6947320
8323352
8716696
9634142
[eg15ph07:t24user] /Temenos $
'/http/ {print $2}');for x in $i; do kill -9 "$x";done
Wed Feb 14 14:30:53 GMT 2018
[eg15ph07:t24user] /Temenos $ date; /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
Wed Feb 14 14:32:29 GMT 2018
[eg15ph07:t24user] /Temenos $ cd /u01/3rdParty/As/IBM/HTTPServer/bin
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $ date; apachectl -k graceful
Wed Feb 14 14:36:23 GMT 2018
[eg15ph07:t24user] /bin $ ps -ef | grep http
t24user 6422990 9634156 0 14:36:24 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d


t24user 8716738 9568694 0 14:36:37 pts/0 0:00 grep http
Time JMeter errors Adjusted errors
14:30:53 0 0
8.5.3 Results sum m ary
The expected result was tno loss of transactions but we saw a proxy error during a customer creation which
resulted in loss of one customer and its corresponding accounts and cash deposits downstream.

ACCOUNT 12000 11998 2 5 1
DEPOSIT 12000 11998 2
As result of one customer not being created, there was a domino effect errors on the remaining transactions
depending on that missing customer.

The sizing is such that the Web Servers are not stressed completely. That does not allow us to draw
conclusions on CPU & disk utilisation when affecting such light components as IHS. The network utilisation
diagram is the one that gives the most accurate picture.
IHS1
The failure event takes place in the circled portion in the above diagram. It quite clearly shows how traffic
drops to zero and then resumes to normal. Please note that the difference in the network activity between
IHS 1 and IHS 2 is due to the fact that these are housed in the VMs of Web Server 1 and 2 respectively. This
means that along with IHS activity we have NA activity but also DM activity additionally for Server 1.

IHS2
Similar to IHS 1 above, the circled period shows the drop in network activity. After the restart, normal
operation resumes.
8 .6 Graceful restart of IHS

While JMeter running, IHS was shutdown gracefully. The expected behaviour was that:
l Traffic would be routed to the remaining IHS until the failed host was back.
l WebSphere app servers would remain balanced.
l Traffic would be balanced between the two IHS instances once the failed IHS had fully recovered.
l There should be no traffic disturbance when session replication was in place.
8.6.1 IHS1 graceful shutdown procedure
graceful-shutdown-webserver1.txt
[eg15ph06:root] / # grep | awk '/http/ {print $2}');for x in $i; do echo "$x";done

<

Wed Feb 14 14:23:23 GMT 2018

6291944
7995682
8454518
9109848
9896296
10223938
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / # grep | awk '/http/ {print $2}');for x in $i; do kill -9 "$x";done
<
Wed Feb 14 14:23:51 GMT 2018
[eg15ph06:root] / # ps -ef | grep http
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / # cd cd /u01/3rdParty/As/IBM/HTTPServer/bin/
ksh: cd: 0403-011 The specified substitution is not valid for this command.
[eg15ph06:root] / # cd /u01/3rdParty/As/IBM/HTTPServer/bin/
[eg15ph06:root] /bin # date; apachectl -k graceful
[eg15ph06:root] /bin # date; apachectl -k graceful
Wed Feb 14 14:34:09 GMT 2018
[eg15ph06:root] /bin # ps -ef | grep http


Shutdown process Time
IHS1 14:34:09
8.6.2 IHS2 graceful shutdown procedure
graceful-shutdown-webserver2.txt
'/http/ {print $2}');for x in $i; do echo "$x";done
Wed Feb 14 14:30:22 GMT 2018
4653514
6422986
6947320
8323352
8716696
9634142
'/http/ {print $2}');for x in $i; do kill -9 "$x";done
Wed Feb 14 14:30:53 GMT 2018
[eg15ph07:t24user] /Temenos $ date; /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
Wed Feb 14 14:32:29 GMT 2018
[eg15ph07:t24user] /Temenos $ cd /u01/3rdParty/As/IBM/HTTPServer/bin
[eg15ph07:t24user] /bin $ date; apachectl -k graceful
Wed Feb 14 14:36:23 GMT 2018
[eg15ph07:t24user] /bin $ ps -ef | grep http
t24user 8716738 9568694 0 14:36:37 pts/0 0:00 grep http

Shutdown process Time
IHS2 14:36:23
When IHS was gracefully shutdown, there was no data loss experienced. The initial disturbance in the
beginning of the test, is, as in a few other tests, the result of not ramping-up the injected load from JMeter.
ACCOUNT 12000 12000 0 0 0
DEPOSIT 12000 12000 0
This test does not show any difference is workload as the IHS is shutdown gracefully, thus is able to
rebalance the traffic seamlessly to the remaining IHS. The fact that both the IHS are housed on the same VMs
as the Web Servers, means that any slight difference in CPU and disk activity is going to be masked by the
other processes. The network activity is not impacted because the messages are rerouted to reach IHS 2 and
then are balanced between Web Servers normally.

The following diagrams demonstrate this. In the same way as 8.5 Shutdown and start of IHS processes on
Web layer, the DM in Web Server 1 masks the activity based purely on message exchange. Network activity
between Web Server 2 and 3 is much more similar.
Web Server 1
Web Server 2
Web Server 3

8 .7 Restart of Web layer VM Nodes

While JMeter was running, all the web layer hosts were rebooted one at a time, and the expected behaviour
was:
l Some disturbance.
l Cluster should recover to a normal state after each rebooted host has recovered.
Web layer hosts Process affected JMeter errors JMeter adjusted Time
Host 1 DM, NA, AS, IHS 25 10 16:00:10
Host 2 NA and AS, IHS 3 2 16:05:15
Host 3 NA and AS Not rebooted
Server reboot of Web Layer nodes while injecting transaction through JMeter resulted in proxy error being
reported and some loss of data.


In the following image we can see two purple spikes that represent two errors followed by a third one after a
while. These are caused by the restart of the Web2 node. Notice that the third error is not caused by the
actual restart but by one of the two failed transactions previously.
8.7.3 DB Records sum m ary

ACCOUNT 12000 11992 8 19 5
Given that 3 customers failed, we expect that to be followed by 6 accounts and 6 cash deposits missing
records. Instead, there were an additional 2 account and 2 cash deposits records missing. The 2 missing cash
deposits records are explained by the 2 missing accounts. Therefore, the adjusted total is 3+2+0=5.

This test was carried in conjunction to the restart of App layer VM nodes, as mentioned in 8.7 Restart of
Web layer VM Nodes. During the test, the behaviour was as expected for all servers that were restarted.
Just before the restart of the last Web Server, all the transactions were pumped and thus the test was de facto
over. We took the decision not to repeat the test just to see the Web Server 3 restarting as we had already
proven the resilience of the system. Following are the most relevant resource utilisation diagrams for Web
Servers 1 and 2.
Web Server 1
Initiated the restart at 16:00:10. Stopped at 16:00:11 and restarted at 16:02:47.

Web Server 2
Initiated the restart at 16:05:15. Stopped at 16:05:16 and restarted at 16:07:53.

The CPU utilisation of Web Server 2 and 3 were typically half of what Web Server 1 was. This seemed to be
the case for all our tests. The reason behind this, is that Web Server 1 is the server running DM on behalf of
the cluster.

The network activity observed after the restart is not typical. We would expect it to bounce right up to the
previous levels. The reason behind this, is that the test was closing to an end, and the transactions were
already ramping down and the parallel sessions from JMeter were slowly reaching zero according to the test
rules.

9 Messaging layer HA tests

9 .1 IBM MQ failures
The procedure for this test was as follows:
1. Kill QM process on host 1. Expected behaviour:
l Standby QM on host 2 will be started.

l All messages will be handled by QM on host 2.
l Some failure is expected if marooned messages timeout.
2. Start QM process on host 1. Expected behaviour:
l QM in host 1 starts in standby mode.

l No disturbance is expected
3. Kill QM process on host 2. Expected behaviour:
l Standby QM on host 1 will take over.

l All messages will be handled by QM on host 1
l Some failure is expected if marooned messages timeout
4. Start QM process on host 2. Expected behaviour:
l QM in host 2 starts in standby mode.

l No disturbance is expected
5. Reboot host 1. Expected behaviour is similar to killing QM on host 1.
Downtime is the time taken by the standby to become active. For ease of reference, QM on host 1 will be QM
1 and QM on host 2, QM 2.

9.1.1 Kill Q ueue M anager and M Q VM restart procedure
killing-QM1.txt
===========MQ Status of both host before kill

[eg15ph04:root] / # dspmq -m QM1
QMNAME(QM1) STATUS(Running)
[eg15ph04:root] / #
QMNAME(QM1) STATUS(Running as standby)
[eg15ph05:root] / #
===============killing QM1 ======
dspmq -m QM1
[eg15ph04:mqm] /mqm # ps -ef | grep "QM1 -x" | grep -v NF | grep -v grep | awk '/QM1/ {print
$2,$(NF-1), $(NF)}'
7930338 -u mqm
[eg15ph04:mqm] /mqm $
===========Status QM1and MQ2 after killing QM1=====================================
[eg15ph05:root] / #

QMNAME(QM1) STATUS(Running elsewhere)
[eg15ph04:root] / #
JMeter errors Adjusted errors Downtimes (secs)
6 2 9
killing-QM2.txt
===========MQ1 restart=====================================
[eg15ph04:root] / #strmqm -x QM1
[eg15ph04:root] /
[eg15ph05:root] / #
========Kill MQ 2=====================
[eg15ph05:mqm] /mqm # ps -ef | grep "QM1 -x" | grep -v NF | grep -v grep | awk '/QM1/ {print
$2,$(NF-1), $(NF)}'
7930338 -u mqm
[eg15ph05:mqm] /mqm #
QMNAME(QM1) STATUS(Running elsewhere)
[eg15ph05:root] / #

44 28 26
rebooting-VM-(QM1).txt
======================Rebooting Host 1 =======================

[eg15ph05:root] / #strmqm -x QM1
eg15ph05:root] / # dspmq -m QM1
[eg15ph04:root] / #
[eg15ph04:root] / #reboot
QMNAME(QM1)
2 1 31
We experienced data loss after killing the Queue Manager. This would have happened during the switch
over to the standby Queue Manager. Below is a sample of the errors as seen in JMeter.

9.1.3 Records count
The following tables show the data loss for the different transaction done through JMeter.
We observed that failover from QM on host 2 (QM 2) to QM on host 1 (QM 1) took longer and produced
more errors than failover from QM 1 to QM 2.
Failover from QM 1 to QM 2
ACCOUNT 12000 11998 2 6 2
DEPOSITS 12000 11997 3
Failover from QM 2 to QM 1
ACCOUNT 12000 11990 10 42 28
DEPOSITS 12000 11970 30

Restart VM 1 (host 1)
ACCOUNT 12000 11999 1 2 1
DEPOSITS 12000 11999 1
9.1.4 JM eter aggregate sum m ary

This diagram is indicative of the disturbance in the transactions. At the time of the QM 1 failure we have two
transactions failing, which happened to be en route from QM 1. Afterwards we see ripple effects because it so
happened that one of the failing transactions was a Customer creation.
As there is no logic in JMeter script to run transaction conditionally, all subsequent Account and Cash
Deposit transactions for this missing customer fail because of the Temenos Core Banking application logic.

Time of QM 1 kill is 11:04:07. The switch from the primary QM to the Stand-by QM can be seen in the
following two network utilisation diagrams from QM 1 and QM 2.
QM 1
The red line indicates the time of failure of QM 1. As expected all network activity drops to near zero.

QM 2
Immediately after the QM 1 failure, QM 2 picked up the load. Web Servers were not disrupted for too long.
Web1 shows a hiccup at the time of the event. Similarly, Web2 and 3 show similar behaviour.
Web1
Similar to Web Servers, App servers have a small disruption at the time of the event.

App2
The effects are delayed a bit as we move lower to the architecture. The DB if it has any impact at all, it is
mostly lost in the runtime turbulence.
DB1

Reverting to the first QM has a similar effect. Now QM 2 is killed so that QM 1 will take over.
QM 2
QM 2 was killed at 10:37:35. The red line signifies this event.
QM 1

Immediately after QM 2 going down, we have QM 1 taking over again. Web Servers have the usual
disruption at the time of the event.
Web1
We see similar network utilisation diagrams for Web2 and 3. Application Servers have a similar disruption as
with Web Servers. For example, observe App1:

This time the DB servers are impacted by the QM failovers. In the previous test when QM 1 failed to QM 2,
there was no or minimal disruption. In addition, we do not see any time shifting of the disruption as we
move lower to the architecture.
DB1

10 Data layer HA tests

1 0 .1 DB node failure
In preliminary tests we found that after the restart of the node with the simulated failure, the load was not
distributed equally between the servers because of the following two reasons:
l The load was not high enough as the DB servers were sized to be more than capable of handling each
one alone, one JMeter instance.
l The messages from a JMeter thread tended to be routed to the same DB node, as long as there was
capacity to serve them.
To overcome this effect and to prove the rebalancing of the DB nodes, we used 5 instances of JMeter with 10
threads each. Three of them started from the beginning and an additional instance would be started after the
restart of each of the two DB nodes. Thus, the test script was the following:
1. Start Oracle Console.
2. Start injection with 3 JMeter instance (10 threads each).
3. Execute the following command on DB 1:
Shutdown transactional
4. Start DB 1.
5. Start injection with one additional JMeter instance (10 threads).
6. Execute the following command on DB 2:
Shutdown transactional
7. Start DB 2.
8. Start injection with one additional JMeter instance (10 threads).
The test results were as expected:

l Inflight transactions that have been started on the restarted node are expected to fail.
l All traffic will be moved to remaining nodes.
l Once the node has fully started, traffic will be balanced between the three nodes.
The timing of the DB node shutdowns were as follows:
DB node Stop time Command JMeter errors Adjusted errors Start time Command
Node 1 09:01 shutdown transactional 48 23 09:04 startup
Node 2 09:06 shutdown transactional 44 22 09:10 startup
The DB servers were removed from the server with a small disturbance and when started again, they were
successfully contributing to the cluster. The total transactions missing from the DB were 45, for both the
shutdowns.
10.1.2 Record count
The following table summarises the effect this test had in the application data.
ACCOUNT 12000 11979 21 82 45
DEPOSITS 12000 11947 53
Given that 8 customers failed, we expect that to be followed by 16 accounts. Since we had 21 missing
accounts, that means that we had an additional 21-16=5 failed transactions. Because 21 accounts are missing,
we expect this to trigger 21 missing deposits as well. Therefore, 53-21=32 missing deposits cannot be
explained. So the actual failed transactions due to the test were 8+5+32=45 and all the rest were due to the
application logic and the JMeter test script design.

10.1.3 Enterprise M anager DB nodes view.
Before DB 1 goes down
In this view we can see all 3 DB nodes actively serving the application.
After DB 1 goes down
The two remaining nodes take up the entire load.
After DB 1 is back up and DB 2 down

Here we can see that DB1 is contributing to the cluster equally to DB 3. DB 2 is missing, but this does not stop
the service.
10.1.4 Error sum m ary
Below is an example error, recorded in JMeter.
During the test, the DB behaved well and balanced the load correctly. The DB capacity was quite big and thus
the stress was never in high levels.

DB 1
This is the full screenshot of the Enterprise Manager (EM) Console for DB 1 load. This is taken after the
restart of DB 1 (9:01 - 9:04). Observe that the node starts picking up load.
At 9:06 DB 2 goes down with no adverse effects to the DB 1 node. After DB2 come back up again at 9:10, the
operation is normal. The increase that takes place at around 9:14 is purely operational and not a consequence
of the DB 2 restart.
DB 2

This screenshot is from DB 2, after its restart (9:06-9:10). The utilisation diagram is similar to the DB 1, without
taking into account the missing load metadata, before the restart.

11 COB HA tests
Using the baseline database, failure tests were performed during the COB. The test involved:
l Killing of Application layer AS nodes, Admin server, Node Agent, IHS one by one but only when
previous kill has been restored.
l Shutting down one of the RAC nodes.
Expected result:
l COB is expected to take longer when app servers in app layer or DB nodes are killed.
l These failures should not stop COB from progressing and finishing successfully.
l When COB is finished, there should be no difference in database performance compared to the baseline
test.
1 1 .1 COB under failure condition

COB was started in servlet mode with two COB agents running on each one of the three application layer
nodes.
11.1.1 Kill procedure
cob_under_failure_condition.txt
Test-AT8 #1 > COB Failure

APPPR1
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:47:37 GMT 2018
[eg15ph09:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/dmgr/ {print $2}')
Tue Jan 16 07:47:48 GMT 2018
Tue Jan 16 07:47:52 GMT 2018
Tue Jan 16 07:47:56 GMT 2018
Tue Jan 16 07:48:02 GMT 2018
[eg15ph09:t24user] /Temenos $ $DMGR_HOME/bin/startManager.sh
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Dmgr01/logs/dmgr/startServer.log

ADMU3100I: Reading configuration for server: dmgr

ADMU3000I: Server dmgr open for e-business; process id is 10223888
Tue Jan 16 07:48:47 GMT 2018
------
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:49:19 GMT 2018
[eg15ph09:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Jan 16 07:49:27 GMT 2018
Tue Jan 16 07:49:31 GMT 2018
Tue Jan 16 07:49:39 GMT 2018
[eg15ph09:t24user] /Temenos $ $NODE_HOME/bin/startNode.sh
Tue Jan 16 07:50:16 GMT 2018
------------
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/AppSrv/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:51:07 GMT 2018
[eg15ph09:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/AppSrv/ {print $2}')
Tue Jan 16 07:51:15 GMT 2018
Tue Jan 16 07:51:18 GMT 2018
=================
APPPR2:
Tue Jan 16 07:52:13 GMT 2018
Tue Jan 16 07:52:19 GMT 2018
Tue Jan 16 07:52:23 GMT 2018
Tue Jan 16 07:52:24 GMT 2018
Tue Jan 16 07:52:26 GMT 2018
Tue Jan 16 07:52:50 GMT 2018

Tue Jan 16 07:53:27 GMT 2018
----------
Tue Jan 16 07:53:54 GMT 2018
Tue Jan 16 07:54:00 GMT 2018
Tue Jan 16 07:54:04 GMT 2018
=====================
APPPR3:
Tue Jan 16 07:54:55 GMT 2018
Tue Jan 16 07:55:01 GMT 2018
Tue Jan 16 07:55:12 GMT 2018
Tue Jan 16 07:55:51 GMT 2018
----------------------------------
Tue Jan 16 07:56:00 GMT 2018
Tue Jan 16 07:56:08 GMT 2018
Tue Jan 16 07:56:11 GMT 2018
=================================
DBPR1 : ~ couple of minutes after APPPR3.
[LIVE-DB3 root@eg15ph18 /]# su - oracle

[LIVE-DB3 oracle@eg15ph18 ~]$ sqlplus / as sysdba
SQL*Plus: Release 12.1.0.1.0 Production on Tue Jan 16 07:57:00 2018
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options
SQL> select 1 from dual

2 ;
1
----------
1
SQL> shutdown abort;

ORACLE instance shut down.
SQL>
SP2-0042: unknown command "" - rest of line ignored.
SQL> select 1 from dual;
select 1 from dual
*
ERROR at line 1:
ORA-01034: ORACLE not available
Process ID: 0
Session ID: 0 Serial number: 0

select 1 from dual
*
ERROR at line 1:
Process ID: 0
SQL> startup;
ORACLE instance started.
Total System Global Area 1.0255E+10 bytes

Fixed Size 2764064 bytes
Variable Size 1811940064 bytes
Database Buffers 8153726976 bytes
Redo Buffers 286781440 bytes
Database mounted.
Database opened.
1
----------
1
SQL>
Started all TSM again!

=======================
DBPR2 : ~8:35 GMT

Connected to:
1
----------
1

select 1 from dual
*
ERROR at line 1:
Process ID: 0
SQL> startup;

Database mounted.
Database opened.
1
----------
1
SQL>
===================================
DBPR3: ~8:35 GMT


Connected to:
1
----------
1

select 1 from dual
*
ERROR at line 1:
Process ID: 0
SQL> startup;

Database mounted.
Database opened.
1
----------
1
SQL>
Kill process Kill time Up time
AppDM 11:27:43 11:28:23
AppNA1 11:29:24 11:31:41
AppAS1 11:34:12 11:34:14 (TSM @ 11:36:58)
AppNA2 11:32:01 11:32:29
AppAS2 11:40:22 11:40:32 (TSM @ 11:41:41)

Kill process Kill time Up time
AppNA3 11:32:55 11:33:24
AppAS3 11:46:15 11:46:22 (TSM @ 11:47:25)
In the above table we show the time at which each process was killed and the time it was back up running.
COB is driven by threads running inside WebSphere at the application layer.
When the Application Server process is killed and automatically restarted, the Temenos Service Manager
(TSM) needs to be manually restarted as well, or else COB will have no workers to make progress. That is
captured in the table above by signifying the time TSM was started (i.e. by “TSM @ HH:MM:SS”).

COB start time COB end time Duration Baseline duration Comments
11:26:24 11:57:52 31 mins 28 secs 31mins 2secs Times extracted from como file.
All Core Banking Services and indeed COB, are very resilient and flexible. Agents can be started automatically
or on demand with no disturbance in the process.
As was expected, COB finished successfully with no errors and no differences in the record count, although
it took slightly longer because of the lost time that resulted from the restart of the agents.
During this test the load was not coming from the Web Server / MQ, but was generated by tSA agents, which
were running in the JVM process raised for WebSphere. As a consequence, DM and NA do not have any
impact on the operation of the COB.
Failure of AppDM
The COB is not affected at all. This means that CPU, memory and network are not affected as well. For
example note in the following diagram the CPU activity for the duration of the DM failure:

There is a spike, but this is hardly related to an extraordinary event. Rather, it is part of normal ups and
downs in COB activity, due to changing from CPU intensive to DB intensive tasks.
The only diagram that signifies an exceptional event, is the disk activity. COB is not very intensive on the disk
of the application server, so re-initialising DM produces a disk activity that stands out:

Failure of AppNA 1,2 and 3
Similarly to the DM failure, NA failure does not affect COB to the slightest. Moreover, disk activity is also not
significant, because the NA by nature is not a heavy process. The only impact in the system is the somewhat
increased process switching during the failure. The following diagram shows the process switches diagram
for App 1.
The blue circle shows the NA failure. The left spike (orange circle), is due to the DM failure. The right spike
(orange circle) is due to the AS failure that will be discussed in Failure of AppAS 1, 2 and 3.
Failure of NA in App 2 exhibits similar symptoms:

The blue circle shows the NA failure. The orange one, is due to the AS failure that will be discussed in
Failure of AppAS 1, 2 and 3. Finally, we can easily recognise now the failure of NA in App 3:
The blue circle shows the NA failure. The orange one, as usual, the AS failure.

Failure of AppAS 1, 2 and 3
By killing the AS, JVM is killed and all threads are killed as well. This means that all COB services die
unexpectedly. This is directly affecting execution of COB, which is interrupted in the affected server and all
uncommitted transaction blocks due to the unexpected termination of the particular COB services, are rolled
back in the DB. The following graphs shows the App 1 resource impact:
The circled area is the part of the diagram that shows the AS failure. We can see that Temenos Core Banking
services die, and after some brief CPU activity due to the automatic restart and initialisation of AS, the server
stops contributing to the COB service. Note that COB carries on normally because of the agents running in
the rest of the servers in the cluster. After the services manager (TSM) is restarted, we do not observe
immediate activity. This is because TSM takes some time to raise back the required COB agents.
Network utilisation tells the same story:

A similar picture is observed in App 2 when the server goes down:

Finally, the App 3 behaviour:

The effects on the DB are not severe, since in each failure scenario, only 1/3 of the agents are affected. CPU is
not affected much because the remaining COB agents continue to compete for faster execution. For example
the following is the CPU diagram of DB 1. Similar is the picture on DB 2 & 3. Note the insignificant CPU
fluctuations during the three AS failure events:

Network is much more affected, since all transaction inflow from 1 of the 3 servers is interrupted. All three
DB servers have similar network utilisation diagrams:

In all three DB servers there is an impact on network traffic, but nothing significant regarding CPU and
memory.

1 1 .2 COB with database VM restart

Restarting a DB VM did not have any impact on COB running as incomplete jobs resumed processing on the
remaining nodes. In this test there were two scenarios:
1. DB 1 goes down and stays down for the duration of the COB.
2. DB 1 goes down and is restarted before the COB finishes.
Test COB start time COB end time Duration Baseline duration
1 12:37:26 13:04:28 27 mins 02 secs 31 mins 02secs
2 13:26:34 13:55:24 28 mins 50 secs 31 mins 02secs
We did not expect to see COB finishing earlier than the baseline test. When COB is executed, the ordering on
the COB jobs is determined at the start of COB. This means that running COB several times, even with an
identical restored DB, under similar conditions, the duration is expected to vary in the range of a few
minutes.
The difference of 2 mins for test #2 can be easily explained by this variation in the COB duration. The
difference of 4 mins for test #1 is more significant. Since during this test one DB node was down for most of
the duration of the test, some of the time difference could be attributed to the decreased cluster
communication overhead between the DB nodes. This is an assumption which could not be verified.
Both tests (shutdown and restart) were successful. The resource utilisation for the active DB nodes does not
change noticeably when the targeted node goes offline.
DB node 2 goes offline during COB
Both CPU and network demonstrate well the failure event. The CPU utilisation of DB 1 is shown by the
following graph:

The red vertical line shows the time of DB 2 going down. DB 2 on the other hand loses most CPU activity
after the time of the event:

We don't need to show the CPU utilisation diagram for all the Application servers - all of them are similar to
the following (App 1):
In all diagrams, both in DB and App, we notice a dip in activity after two minutes from the DB 2 going down.
This is probably due to rolling back the transactions submitted to DB 2, which in turn temporarily blocks
normal progress of COB.
DB node 2 restarts during COB
Both CPU and network demonstrate well the failure event. The CPU utilisation of DB 1 is shown by the
following graph:

The red line represents the time DB 2 goes offline and the green line the time it goes back online. DB 2 on the
other hand loses most CPU activity after the time of the event, until the restart:

There is no reason of showing the CPU utilisation diagram of all Application servers; all of them are similar to
the following (App 1):
The peak in activity at around 13:38 seems too much after the shutdown of DB 2, to be caused by it. Most
probably this is purely operational (for example, a multithreaded, CPU intensive job). This fluctuation is not
reflected in the DB nodes.

12 DR tests with online traffic

1 2 .1 Site switchover
l Generate traffic with JMeter and let it run for 10 minutes.
l Execute switchover.
l Diversion of online traffic is done by updating the “host” file in the JMeter server, to point to the DR
Load Balancer.
12.1.1 Switchover procedure

Switchover start Switchover finish Downtime (sec) Total service unavailability (sec)
14:40:16 14:42:45 149 ~270
The JMeter results are not important in this particular test, as the errors were dependent only on the duration
of the waiting time during the manual switchover. All transactions before and after the switchover were
successful.
12.1.3 Transactions per second
All failures took place during the switchover, as shown by the JMeter TPS diagram.

12.1.4 DB Records sum m ary
The missing records exact number is not important in this test. The longer we chose to wait for the manual
switchover, the longer the online traffic would keep failing.
Please note that the adjusted total in this case should be about the same as the total, because although there
was a business reason why some transactions would fail because of a missing customer, in this case we know
that communications were down so all failures were because of that.
CUSTOMER 6000 5116 884
ACCOUNT 12000 10232 1768 4423 ~4423
DEPOSITS 12000 10229 1771
Since the traffic is diverted from the live site to the DR site, the resource utilisation diagrams are
complementary; on the left side we have the live servers and on the right the DR ones.
For brevity and because of the Live - DR topology differences, only the primary cluster server from each layer
is compared. The red line on the left diagram is the live site switchover and the green line on the right, is the
DR switchover event.

Load balancers
Web servers

Application servers
DB servers
1 2 .2 Site failover
Site failover test summary:

l Generate traffic with JMeter and let run for 10 minutes.
l Simulate failure: run srvctl stop database –d t24db –o abort, wait a couple of minutes and then
execute failover.
l Diversion of online traffic is done by updating the “host” file in the JMeter server, to point to the DR
Load Balancer.
12.2.1 Failover procedure

srvctl stop database -d t24db -o abort
Failover start Failover finish Downtime (sec) Total service unavailability
17:11:20 17:13:42 142 242
The JMeter results are not important in this particular test, as the errors were dependent only on the duration
of the waiting time during the failover. The important element of this test is that the failover took place and
normal operation continued with the help of the DR site.
12.2.3 Transactions per second
After the failover, normal operations resumed.

The missing records exact number is not important in this test. The longer we chose to wait for the failover,
the longer the online traffic would keep failing.
Note that the adjusted total should be about the same as the total. Although there's a business reason why
some transactions would fail if there was a missing customer, in this case we know that almost all the failures
were caused by the fact that communications were down.
CUSTOMER 6000 5880 120
ACCOUNT 12000 11664 336 916 ~916
DEPOSITS 12000 11540 460
This test is quite similar to the switchover regarding the resource utilisation. Again, as the traffic is diverted
from the live site to the DR site, the resource utilisation diagrams are complementary - on the left side we
have the live servers and on the right the DR servers.

For the sake of brevity and the live vs DR topology difference, only the primary cluster members from each
layer are compared. The red line on the left diagram is the live site switchover and the green line on the right,
is the DR switchover event.
Load balancers

Web Servers
Application servers

13 DR tests with COB

1 3 .1 Site switchover
The objective of this test was to execute a switchover during the COB. The expected result was that COB
would be able to finish normally and with no errors on the DR site.
COB was started in servlet mode with two agent running on each of the three live site node. At 15.93%
during the System Wide Stage, switchover was initiated without stopping COB.
After the switchover finished, we raised in servlet mode two COB agents on each of the two application
servers of the DR site. The COB continued normally and with no errors until its completion.
COB start COB stop Switchover Switchover COB

Stage TSM/COB
time time start end finished
10:10 10:16 System wide 15.93% 10:17:56 10:20:30 10:22 10:47
The downtime for switchover was about 6 minutes. Even though the actual switchover took approximately 3
minutes, the database needed at least 2 minutes to get itself into functioning state.

When running COB, only the Application and Data Layers are utilised, so that is what we are going to
concentrate on. For brevity and because the live vs DR topology difference, only the primary cluster servers
from these two layers are compared.
Application Servers
Database Servers

1 3 .2 Site failover
The objective of this test was to execute a failover during the COB. The expected result was that the COB
would be able to finish normally on the DR site and with no errors.
13.2.1 Test procedure
COB was started in servlet mode with two agent running on each of the three live site node. At 72% during
the System Wide Stage, failover was simulated without stopping COB. After the failover finished, we raised
in servlet mode two COB agents on each of the two application servers of the DR site. The COB continued
normally and with no errors until its completion.
13.2.2 Test sum m ary

COB start Failover
Stage Failover start COB restart COB finished
time end
14:05 System wide 72% 14:21:20 14:24:21 14:25 14:35:59

13.2.3 CO B m onitor after failover
The image below shows COB resuming on the DR site after a successful failover process:
When running COB, only the Application and Data Layers are utilised, so that is what we are going to
concentrate on. For brevity and because the live vs DR topology difference, only the primary cluster servers
from these two layers are compared.
Application Servers

14 Scalability testing: adding physical nodes

Due to the limited resources, this test was performed by first taking down one of the nodes from both the
Web Layer and App Layer so that the architecture will look as if just two nodes from each layer were
configured. JMeter scripts were then run to inject transactions and then later restarted the standby Web and
App layer node.
Both tests scalability tests (i.e. adding a node in Web and App layers) were executed in the same test. The
setup of the test was with one instance of JMeter, pumping transactions with 10 concurrent users, each one
executing 200 full test cycles. One test cycle is a user login, creation of customer, foreign and local account,
etc. until the logout.
Node added When
Web 3 17:01:26
App 3 17:06:08
As we observed in previous tests, when we have graceful shutdown of servers, we did not have any missing
DB records. In this instance, we do not have a shutdown but an addition of a server to the cluster, but it
involves no sudden failure, forced restarts or anything other unexpected, so it falls under the same principle.
1 4 .1 Adding a Web layer node to the existing cluster
14.1.1 Test result
IHS was able to capture the new node and online traffic was balanced among all three nodes and there was
no impact on the existing traffic.
The Network activity demonstrates the load distribution and the behaviour of the cluster at the time of the
new server addition. The high peaks and low valleys in the following diagrams are part of the normal
operation during the test and they take place even before the node addition, so the reader should not
interpret them as consequences of the node addition.

This diagram shows the activity of the Web 2 node. For convenience there are two marks: the left one is the
Web 3 node addition and the right one is the App 3 node addition that is discussed in 14.2 Adding an App
layer node to the existing cluster. We get a similar picture from the Web 1 network activity diagram but for
clarity it is not presented here.
The following diagram is the network activity from the Web 3 node that was actually added at 17:01:26:

1 4 .2 Adding an App layer node to the existing cluster
14.2.1 Test result
The load from the web layer was uniformly distributed on all app servers in the app layer and there was no
impact on the existing traffic.
As with the test above, we present the network traffic diagram as the most revealing one. Following is the
diagram for App 2, which is part of the cluster for the duration of the test:
The high peaks and low valleys in the following diagrams are part of the normal operation during the test
and they take place even before the node addition, so the reader should not interpret them as consequences
of the node addition.
Following is the App 3 network activity diagram:

A short while after the node addition to the Application Layer cluster, it starts picking up load and
contributes as an equal member.

15 Glossary
Term Definition
App1, App2 and App3 are abbreviations for the Application Server 1, 2
Appx
and 3 respectively.
AS WebSphere Application Server
AS1, AS2 and AS3 are abbreviations for the WebSphere Application
ASX Server on Server 1, 2 and 3 respectively, on either the Web layer or App
layer.
BrowserWeb Temenos Web UI, used for accessing the Temenos Core Banking.
COB Close of Business
DB Database
DB1, DB2 and DB3 are abbreviations for the Database Server 1, 2 and 3
DBX
respectively.
DM WebSphere Deployment Manager
DR Disaster Recovery
EM Oracle Enterprise Manager
HA High Availability
IHS IBM HTTP Server
IHS1 and IHS2 are abbreviations for the IBM HTTP Server 1 and 2
IHSX
respectively.
JDK Java Development Kit
Abbreviation for WebSphere MQ (now IBM MQ). See WebSphere MQ

MQ
below.
NA WebSphere Node Agent
NA1, NA2 and NA3 are abbreviations for the WebSphere Node Agent on
NAX
Server 1, 2 and 3 respectively, on either the Web layer or App layer.
Oracle Database 12c is an enterprise-class database from Oracle. Its

Oracle 12c
features include pluggable databases and multitenant architecture.

Term Definition
QM Queue Manager
QM 1 and QM 2 refer to Queue Manager, running in MQ host 1 and MQ

QMX
host 2 respectively.
T24 T24 was the initial name for Temenos's Core Banking solution.
T24Browser Web User Interface for Temenos Core Banking.
TAFJ Temenos Application Framework Java
TPS Transactions per second
Core Banking Service Agent. This is a working thread that can be

tSA
assigned jobs by the TSM (see below) to run in the background.
Core Banking Service Manager. This is the manager process of all tSA
TSM
(see above) for a particular Application Server instance.
UI User Interface
VM Virtual Machine
Web1, Web2 and Web3 are abbreviations for Web Server 1, 2 and 3
WebX
respectively.
WebSphere Application Server (WAS) is an IBM web application server

WebSphere product. It is a software framework and middleware that hosts Java
based web applications.
Renamed IBM MQ in 2014. IBM's enterprise messaging solution allows

WebSphere MQ independent and potentially non-concurrent applications in a distributed
system to communicate securely with each other.

IBMS4 RA HADR Test Report 1.1 PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

IBMS4 RA HADR Test Report 1.1 PDF

Încărcat de

Drepturi de autor:

Formate disponibile

IBM Stack 4 Reference Architecture

HADR Test Report 1.1 August 2018

C r eated by JumpS t art

2 C r eated by JumpS t art

7 Application layer HA Tests 28

8 Web layer HA tests 45

3 C r eated by JumpS t art

8.6.4 System resources usage 76

9 Messaging layer HA tests 85

10 Data layer HA tests 97

11 COB HA tests 103

12 DR tests with online traffic 124

4 C r eated by JumpS t art

12.2 Site failover 127

13 DR tests with COB 132

14 Scalability testing: adding physical nodes 136

5 C r eated by JumpS t art

1 About this test report

6 C r eated by JumpS t art

Kaydee Dzvuke and Christos

1.1 August 2018 Final review Nanda Badrappan

7 C r eated by JumpS t art

Nanda Badrappan Project Manager

Simon Henman Project Manager

Christos Tsirkas Specialist

Kaydee Dzvuke Specialist

Mohand Oussena Performance Architect

Sheeraz Junejo Solution Architect

Yanxin Zhao Database Developer

Raviteja Penki Specialist

Adam Deleeuw Technical Consultant

Ian Marshall UK Business Partner, Solutions Hub

Jag Jhajj Business Solutions Architect

8 C r eated by JumpS t art

9 C r eated by JumpS t art

3 .1 HA tests with online traffic

10 C r eated by JumpS t art

Kill IBM HTTP Server

Reboot all Web Layer

Reboot all App Layer

11 C r eated by JumpS t art

Kill Queue Manager

Restart QM host Reboot one QM host 1 1 31 99.9999%

We used the following formula:

3 .2 HA tests with COB

12 C r eated by JumpS t art

l The down-time duration was not long enough.

l Resource utilisation did not approach 100%.

3 .3 DR tests with online traffic

13 C r eated by JumpS t art

Site switchover with online traffic

Online traffic on live Switchover from ~270

Site failover with online traffic

3 .4 DR Tests with COB

Site failover during COB

14 C r eated by JumpS t art

Site switchover during COB

Start COB on Live DB and execute a

Adding physical nodes

15 C r eated by JumpS t art

l In the live LB is serviced by the live site.

l In the DR LB is serviced by the DR site, provided the DB in the site is active.

16 C r eated by JumpS t art