Sunteți pe pagina 1din 141

IBM Stack 4 Reference Architecture

HADR Test Report 1.1 August 2018

C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Contents
1 About this test report 6
1.1 Trademark 6
1.2 Legal 6
1.3 Document history 7
1.4 Contributors 8
2 Introduction 9
3 Executive summary 10
3.1 HA tests with online traffic 10
3.2 HA tests with COB 12
3.3 DR tests with online traffic 13
3.4 DR Tests with COB 14
3.5 Scalability testing 15
4 Solution deployment 16
4.1 Solution description 16
4.1.1 WebSphere terminology 17
4.1.2 Architecture naming conventions 18
4.2 Architecture diagram 18
4.2.1 Processes in test series 19
4.3 HA design considerations 20
4.4 DR design considerations 20
4.5 Software used 21
5 Testing approach 22
5.1 Test data 22
5.2 Tools 22
5.3 HA tests with online traffic 22
5.3.1 Test traffic generation 22
5.3.2 Test validation 23
JMeter error count 24
Test execution 24
5.4 HA tests with COB 25
5.5 DR Tests with online traffic 26
5.6 DR tests with COB 26
6 Baseline tests 27
6.1 Baseline test: COB against fresh database 27
6.2 Baseline test: COB on DB with added transactions 27

2 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

7 Application layer HA Tests 28


7.1 Kill AS processes on the App layer 28
7.1.1 JMeter transaction summary 29
7.1.2 System resources usage 30
7.2 Graceful shutdown and start of AS Processes on App layer 33
7.2.1 JMeter transactions summary 34
7.2.2 DB records summary 34
7.2.3 System resources usage 35
7.3 Restart App Layer VM Nodes 37
7.3.1 Results summary 38
7.3.2 JMeter transaction summary 39
7.3.3 DB records summary 40
7.3.4 System resources usage 40

8 Web layer HA tests 45


8.1 Kill of AS process Web layer 45
8.1.1 JMeter transaction summary 46
8.1.2 DB records summary 47
8.1.3 System resources usage 47
8.2 Graceful shutdown and start of AS Processes on Web layer 52
8.2.1 JMeter transactions summary 53
8.2.2 DB records summary 54
8.2.3 System resources usage 54
8.3 Deployment Manager failure on App and Web layers 56
8.3.1 Result summary 57
8.3.2 System resources usage 58
8.4 Node Agent failure on App and Web layers 59
8.4.1 Test summary 63
8.4.2 App layer system resources usage 63
8.4.3 Web Layer system resources Usage 66
8.5 Shutdown and start of IHS processes on Web layer 69
8.5.1 Kill IHS1 procedure 69
8.5.2 Kill IHS2 procedure 70
8.5.3 Results summary 71
8.5.4 DB records summary 71
8.5.5 System resources usage 72
8.6 Graceful restart of IHS 73
8.6.1 IHS1 graceful shutdown procedure 73
8.6.2 IHS2 graceful shutdown procedure 75
8.6.3 JMeter transaction summary 76

3 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

8.6.4 System resources usage 76


8.7 Restart of Web layer VM Nodes 78
8.7.1 Result summary 78
8.7.2 JMeter transaction summary 80
8.7.3 DB Records summary 80
8.7.4 System resources usage 81

9 Messaging layer HA tests 85


9.1 IBM MQ failures 85
9.1.1 Kill Queue Manager and MQ VM restart procedure 86
9.1.2 Result summary 87
9.1.3 Records count 88
9.1.4 JMeter aggregate summary 89
9.1.5 JMeter transaction summary 90
9.1.6 System resources usage 91

10 Data layer HA tests 97


10.1 DB node failure 97
10.1.1 Result summary 98
10.1.2 Record count 98
10.1.3 Enterprise Manager DB nodes view. 99
10.1.4 Error summary 100
10.1.5 System resources usage 100

11 COB HA tests 103


11.1 COB under failure condition 103
11.1.1 Kill procedure 103
11.1.2 Result summary 109
11.1.3 System resources usage 109
Failure of AppDM 109
Failure of AppNA 1,2 and 3 111
Failure of AppAS 1, 2 and 3 113
11.2 COB with database VM restart 119
11.2.1 System resources usage 119

12 DR tests with online traffic 124


12.1 Site switchover 124
12.1.1 Switchover procedure 124
12.1.2 Result summary 124
12.1.3 Transactions per second 124
12.1.4 DB Records summary 125
12.1.5 System resources usage 125

4 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

12.2 Site failover 127


12.2.1 Failover procedure 128
12.2.2 Result summary 128
12.2.3 Transactions per second 128
12.2.4 DB records summary 129
12.2.5 System resources usage 129

13 DR tests with COB 132


13.1 Site switchover 132
13.1.1 System resources usage 133
13.2 Site failover 134
13.2.1 Test procedure 134
13.2.2 Test summary 134
13.2.3 COB monitor after failover 135
13.2.4 System resources usage 135

14 Scalability testing: adding physical nodes 136


14.1 Adding a Web layer node to the existing cluster 136
14.1.1 Test result 136
14.1.2 System resources usage 136
14.2 Adding an App layer node to the existing cluster 138
14.2.1 Test result 138
14.2.2 System resources usage 138

15 Glossary 140

5 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

1 About this test report


1 .1 Trademark
TEMENOS T24 is a registered trademark of the TEMENOS GROUP and referred to as ‘T24’.

IBM is a registered trademark of IBM Corporation and/or its affiliates. Other names may be trademarks of
their respective owners.

1 .2 Legal
© Copyright 2018 Temenos Headquarters SA. All rights reserved.

TM
The information in this guide relates to TEMENOS information, products and services. It also includes
information, data and keys developed by other parties.

While all reasonable attempts have been made to ensure accuracy, currency and reliability of the content in
this guide, all information is provided "as is".

There is no guarantee as to the completeness, accuracy, timeliness or the results obtained from the use of this
information. No warranty of any kind is given, expressed or implied, including, but not limited to warranties
of performance, merchantability and fitness for a particular purpose.

In no event will TEMENOS be liable to you or anyone else for any decision made or action taken in reliance
on the information in this document or for any consequential, special or similar damages, even if advised of
the possibility of such damages.

TEMENOS does not accept any responsibility for any errors or omissions, or for the results obtained from the
use of this information. Information obtained from this guide should not be used as a substitute for
consultation with TEMENOS.

References and links to external sites and documentation are provided as a service. TEMENOS is not
endorsing any provider of products or services by facilitating access to these sites or documentation from this
guide.

6 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

The content of this guide is protected by copyright and trademark law. Apart from fair dealing for the
purposes of private study, research, criticism or review, as permitted under copyright law, no part may be
reproduced or reused for any commercial purposes whatsoever without the prior written permission of the
copyright owner. All trademarks, logos and other marks shown in this guide are the property of their
respective owners.

1 .3 Document history
Version Date Change Author

Kaydee Dzvuke and Christos


1.0 June 2018 First release
Tsirkas

1.1 August 2018 Final review Nanda Badrappan

7 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

1 .4 Contributors
Temenos

Name Role

Nanda Badrappan Project Manager

Simon  Henman Project Manager

Christos Tsirkas Specialist

Kaydee Dzvuke Specialist

Mohand Oussena Performance Architect

Sheeraz Junejo Solution Architect

Yanxin Zhao Database Developer

Raviteja Penki Specialist

IBM

Name Role

Adam Deleeuw Technical Consultant

Ian Marshall UK Business Partner, Solutions Hub

Jag Jhajj Business Solutions Architect

8 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

2 Introduction
This IBM Stack 4 Reference Architecture HADR Test Report presents the results of the High Availability
(HA) Disaster Recovery (DR) and scalability testing that Temenos carried out on its Core Banking System.

We tested the architecture using MQ connectivity between the web and application (App) layers. The
software was deployed in line with the R17 Stack 4 for AIX / WebSphere. The stack is supported by Temenos
for all post R16 AMR releases, up to and including R17 AMR.

9 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

3 Executive summary
We carried out high availability, disaster recovery and scalability testing on the Temenos Core Banking
system. The tested architecture is n-tier clustered with manual failover to DR. It comprises four layers:

l Web

l Message

l Application

l Data

3 .1 HA tests with online traffic


The tests show that the solution is highly available and recovers within a few seconds with only a few errors.
The error count shown in the table relate to a total of 5000 transactions. The errors reported by JMeter
correspond with those a real end-user would experience using the system.

Cluster load balancing was found to function well during all failure scenarios.

The numbers in the table are totals. For example, in the Host reboot entry, the 11 JMeter errors are the total
errors caused by all three Virtual Machine (VM) reboots.

The infrastructure behaved as expected under all online failure scenarios. These tests involve the most
demanding and non-graceful events, and almost always result in the loss of en route transactions.

10 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

%
Average Availability
DB missing
HA test description Test details JMeter errors downtime (Assumes
errors
(seconds) 100 events
per year)

Web layer

Kill IBM HTTP Server


1 1 0 99.9999%
IBM HTTP load (IHS)
balancer restart Graceful shutdown one
0 0 0 100%
of the IHS

Kill 1 Deployment
0 0 0 100%
Deployment Manager Manager (DM) process
and Node Agents Kill and restart one by
failures one, all Node Agents 0 0 0 100%
(NA)

Reboot all Web Layer


hosts after the previous
Host Reboot 12 5 162 99.9997%
host has recovered and
all the processes running

App layer

Kill all Application
Server (AS) processes 1 1 7 99.9999%
one by one.

Graceful shutdown of
0 0 220 100%
AS

Kill Deployment
Application server 0 0 0 100%
Manager (DM)
Failure
Kill NA and restart it
0 0 0 100%
and repeat on all NAs

Reboot all App Layer


hosts after the previous
7 5 170 99.9998%
host has recovered and
all the processes running

Data layer

11 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

%
Average Availability
DB missing
HA test description Test details JMeter errors downtime (Assumes
errors
(seconds) 100 events
per year)

Graceful DB node
Node  shutdown 45 45 210 99.9990%
shutdown

Messaging layer

Kill Queue Manager


Kill Queue Manager 30 30 17 99.9993%
(QM)  and Failover

Restart QM host Reboot one QM host 1 1 31 99.9999%

The availability percentage is calculated by assuming one hundred such events over a year and 15
transactions per second (tps) average throughput, as we measured it during the test.

We used the following formula:

3 .2 HA tests with COB


COB is an End Of Day that is multithreaded, transactional, service driven and does not bring the system
offline while it’s running. The following tests were designed to find out how the system behaves under all
failure scenarios when executing Close Of Business (COB).

COB was started in servlet mode within WebSphere, with two COB agents (tSA) per server, resulting in 6
worker threads in total. The service management in the Temenos Core Banking is transactional and extremely
flexible by design. As expected, all the tests finished without any errors.

In all cases, the uncommitted transaction blocks that resulted from the failure scenarios were rolled back in
the DB and execution resumed from whatever available resources were still available.

12 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

HA test COB duration
Test details
description (minutes)

Kill all AS, DM, NA and IHS one by one but only when previous kill has been
COB Under 31
restored.
Failure
Condition Shut down one of the RAC nodes and bring it back up. 31

The COB times are to be compared to baseline tests with 6 tSAs that took 30 min. The duration differences
were not significant because:

l The down-time duration was not long enough.

l Resource utilisation did not approach 100%.

3 .3 DR tests with online traffic


In these test scenarios, the system switches execution from Live to the DR site. This is caused by either
scheduled maintenance (switchover) or a catastrophic event (failover).

Downtime in the table is the time period when a service is unavailable. Downtime includes an additional
voluntary delay -about 2 minutes - between killing the DB and triggering the failover. The test results were as
expected, with failed transactions occurring only for the duration of the system downtime.

In the following two tables, we list the availability percentage, which is calculated by assuming 100 such
events over the period of a year.

The Recovery Time Objective (RTO) is the duration of time and a service level within which a business
process must be restored after a disaster in order to avoid unacceptable consequences associated with a break
in continuity. The RTO is listed in the following two tables and it is essentially the time between the first and
last errors in JMeter.

13 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Site switchover with online traffic

Downtime
%
DR test description Test details Load balancer
DB failover RTO Availability
switchover

Online traffic on live Switchover from ~270


2 mins 29 secs ~0 secs 99.91%
site. live site to DR site. secs

Site failover with online traffic

Downtime
%
DR test description Test details Load balancer
DB failover RTO Availability
failover

Shutdown live DB
~242
Online traffic on live site. and execute a 2 mins 22 secs ~0 secs 99.92%
secs
manual failover.

3 .4 DR Tests with COB


These tests were designed to find out how the system behaves under switchover or failover, while running
the COB. Similar to the tests in 3.2 HA tests with COB, there were no failed transactions and the COB to
finish successfully both after a switchover and a failover to DR.

Site failover during COB

Duration
DR test description Test details
COB DB failover

COB on Live Site Start COB, shutdown Live DB and execute a manual failover. 31 mins 1 min 47 secs

14 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Site switchover during COB

Duration
DR test description Test details
COB DB switchover

Start COB on Live DB and execute a


COB on Live Site 37 mins 2 min 34 secs
manual switchover.

3 .5 Scalability testing
The scalability tests checked the elasticity of the infrastructure by adding a new Web node and a new
Application (App) node. Both tests were successful and as expected the new node successfully joined the
cluster and received traffic according to the load-balancing rules.

Adding physical nodes

SC test
Test details Results
description

Add a new Web layer node to Traffic distributed to the new Web Application Server (AS) as part of
Online traffic the existing cluster. the round-robin load balancing process.
on live site. Add a new app layer node to
New app node received traffic generated from the Web layer nodes.
the existing cluster.

15 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

4 Solution deployment
4 .1 Solution description
Our High Availability solution is a 4-tier architecture, comprising:

l A Web layer.

l An App layer.

l A messaging layer.

l A Data layer.

On the Live site, both the App layer and Web layer are in two separate clusters, each containing three
application servers (AS). Two of the Web layer nodes also have IBM HTTP Server (IHS) configured to
forward in a round-robin fashion the http requests across all three Web layer nodes. Session replication is
enabled so that messages are not lost in case one of the node fails.

An external Apache Load Balancer (LB) is configured on a separate physical machine to forward incoming
http requests to both IHS instances. Traffic received:

l In the live LB is serviced by the live site.

l In the DR LB is serviced by the DR site, provided the DB in the site is active.

The switch between sites was simulated on the level of the JMeter host, by changing the mapping of the IP in
the “hosts” file.

The messaging layer contains two physical hosts, each with a Queue Manager configured in active/standby
mode so that when the active one fails the standby automatically becomes active. The failed QM will need to
be restarted so that it becomes standby.

The same set up also applies to the DR site, except that the Web and App layers have two instead of three
nodes in their respective clusters.

16 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

The Data layer of the DR site is serviced by an Oracle RAC database (DB) with two nodes. This DB is
connected through Oracle’s Data Guard technology to the Live site and added to the host schema, as
required by Oracle technology. The RAC database contains the Temenos Core Banking schema only. The
database on DR site is kept in sync with the live site database using Oracle’s DataGuard technology.

4.1.1 W ebSphere term inology


Term Definition

Deployment Manager (DM) This is the administrative process used to provide a centralized
management view and control for all elements in a WebSphere
Application Server distributed cell, including the management of
clusters.

The DM is responsible for the contents of the repositories on each of the


nodes. This is managed this through communication with node agent
processes on each node of the cell.

In our infrastructure, we had two DMs setup for live and two for DR. In
both sites we had one DM for the Web Layer cluster and one for the
Application Layer (App) cluster. Both of these DMs were setup in the
primary server of each cluster (i.e. Web1 and App1).

Node Agent (NA) An NA manages all processes on a WebSphere node by communicating


with the DM to coordinate and synchronize the configuration. An NA
performs management operations on behalf of the DM. Essentially, the
NA represents an individual node in the cluster.

In our tested deployment, we have one NA per application server


(discussed below).

Application Server (AS) The AS is the primary component of WebSphere. The server runs a Java
Virtual Machine (JVM), providing the runtime environment for the
Temenos code. In essence, the AS provides containers that specialize in
enabling the execution of Temenos libraries and components.

In the tested deployment, we had one AS per node and one node per
host server.

17 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

4.1.2 Architecture nam ing conventions


Architecture Naming convention

Web layer hosts Referred to as Web followed by an increment: Web1, Web2, and Web3.

Application layer hosts Referred to as App followed by an increment: App1, App2 and App3.

Data layer hosts Referred to as DB followed by an increment: DB1, DB2 and DB3.

Message layer hosts Referred to as MQ followed by an increment: MQ1 and MQ2.

IBM HTTP Server (IHS) Referred to as IHS followed by an increment: IHS1 and IHS2.

For ease of reference, we may refer to a WebSphere process followed by the host increment. For example,
AS1 would mean AS on host #1. The DR site follows the same conventions as above.

4 .2 Architecture diagram

18 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

4.2.1 Processes in test series

The following processes were running in the live site as part of our test series.

Live site Process

Load Balancer (LB) Apache server

Web1 DM, NA, AS (T24Browser deployed)  and IHS

Web2 NA, AS (T24Browser deployed) and IHS

Web3 NA and AS (T24Browser deployed)

MQ1 QM active

MQ2 QM passive

App1 DM, NA and AS (T24 & TAFJ deployed)

App2 DM, NA and AS (T24 & TAFJ deployed)

App3 DM, NA and AS (T24 & TAFJ deployed)

DB1 Oracle RAC

DB2 Oracle RAC

DB3 Oracle RAC

The following processes were running in the DR site as part of our test series.

Live site Process

Load Balancer (LB) Apache server

Web1 DM, NA, AS (T24Browser deployed)  and IHS

Web2 NA, AS (T24Browser deployed) and IHS

MQ1 QM active

MQ2 QM passive

App1 DM, NA and AS

19 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Live site Process

App2 DM, NA and AS

DB1 Oracle RAC

DB2 Oracle RAC

4 .3 HA design considerations
Architecture Design consideration

External load balancer As the recommended highly available hardware load balancer was not
available, two Apache servers for the Live and DR site where configured
on two separate physical hosts.

For the online traffic to get access, either on the live site or DR site, a
Windows VM host file needed to be updated with the desired URL/IP of
the machine where the Apache server was installed.

Web layer Temenos BrowserWeb was deployed on all three Web Layer cluster
members. Online requests coming from the load balancer were
distributed across all the Web Layer nodes using the IHS http load
balancer. Session replication was enabled in IHS and no additional
modifications needed to be done in BrowserWeb parameters file.

App layer The Temenos Core Banking (T24) and Application Framework (TAFJ)
libraries, were installed in all 3 app layer cluster members.

We placed both T24 and TAFJ runtime libraries on each cluster member,
as opposed to separated shared storage. There are two IP addresses
provided for the 2 active/standby Queue Managers.

Queue Managers were configured in active/passive mode. Each QM runs


Message layer
on its own physical host.

HA at the Data layer was achieved through a combination of an


Data layer RAC database with three nodes, an App layer that uses a URL with
SCAN addresses to point to the database, and a generic data source type.

4 .4 DR design considerations
The live and DR sites are configured in Active/Standby mode.

20 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

An infrastructure database has been added to host schema required by Oracle technology. The RAC database
contains the Temenos Core Banking schema only. The database on DR site is kept in sync with the live site
database using Oracle’s DataGuard technology.

4 .5 Software used
Temenos

Software Version

T24 R17

TAFJ R17_SP2

TAFJ Java Functions PB201704 03/21/2017

IBM

Software Version

WebSphere 8.5.5.13

IBM HTTP Server(IHS) 8.5.5.12

IBM MQ 8.0.0.7

IBM Java Developer Kit (JDK) 1.7.0

Apache

Software Version

Apache HTTP Server 2.4

Oracle

Software Version

Oracle Database 12c Enterprise Edition Release 12.1.0.2.0

21 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

5 Testing approach
The reference architecture exercise focused on the following three areas, each with a set of specific test cases
to be carried out:

l Availability testing.

l Disaster recovery testing.

l Scalability testing.

5 .1 Test data
We used custom JMeter scripts to commit transactions through the Web User Interface (UI) of the Temenos
Core Banking. The total number of script cycles were 1000.

See also 5.3 HA tests with online traffic.

5 .2 Tools
JMeter and Nmon

5 .3 HA tests with online traffic

5.3.1 Test traffic generation

Traffic is generated by JMeter scripts running on a Windows VM.

The requests are sent to the Apache load balancer, which forwards requests to IHS instances on the two Web
Layer nodes. Fifty users are configured, and five JMeter instances each execute execute 10 concurrent threads
for 10 users. A sample file feeds 1000 JMeter script cycles for processing.

Each user iterates 20 times, which means that the 50 users execute the 1000 loops, drilling data from the
sample file.

Each thread executes sessions that carry out the following tasks:

22 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

1. Go to the login page.

2. Login.

3. Create customer

4. Create two accounts(foreign and local) for the customer.

5. Open till.

6. Make a deposit on local account.

7. Make a deposit on foreign account.

8. Get the Account Balance (ACCT.BAL.TODAY).

9. Make two statements requests(STMT.ENT.BOOK).

10. Logoff.

A constant throughput timer is used to limit throughput to 5 Transactions per Second (TPS). Each JMeter
testing cycle was executed in its own session.

JMeter has been configured to stop the session in case of failure occurring during execution and start a new
one.

5.3.2 Test validation

The JMeter scripts have robust response assertions. In addition, at the end of every test run, the following
JQL scripts are executed against the database to count the total number of records that have been inserted:

COUNT FBNK.CUSTOMER WITH CUSTOMER.NO LIKE 820...


COUNT FBNK.ACCOUNT WITH ACCOUNT.NO LIKE 14...
COUNT FBNK.TELLER WITH @ID LIKE TT171148...
COUNT FBNK.TELLER WITH @ID LIKE TT171147...

The COUNT queries are designed to return all the records that were inserted by the
JMeter scripts (and only these records).

23 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

JMeter error count

Every thread on JMeter executes the requests sequentially. If during a failure test the login page failed, then
all subsequent transactions will fail. To avoid registering these additional errors, which are a consequence of
the previous step, JMeter has been configured to stop the session and start a new one. Errors logged by
JMeter represents errors caused by the failure test.

Errors reported by JMeter reflect what a real end user would see.

Test execution

While traffic is running, the following commands are executed:

IHS kill

date;ps -ef | grep httpd | grep -v  Xvnc | grep -v grep | awk '/http/ {print $2}'
date;i=$(ps -ef | grep httpd | grep -v  Xvnc | grep -v grep | awk '/http/ {print $2}');for x in $i;
do kill -9 "$x";done

IHS graceful shutdown

apachectl -k graceful

Deployment Manager kill

date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'


date;kill -9 $(ps -ef | grep java | awk '/dmgr/ {print $2}')

Node Agent kill

date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'


date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')

App layer Application Server kill

date;ps -ef | grep java | awk '/AppSrv/ {print $2,$(NF-1), $(NF)}'


date;kill -9 $(ps -ef | grep java | awk '/AppSrv/ {print $2}')

24 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Web Layer Application Server kill

date;ps -ef | grep java | awk '/WebSrv/ {print $2,$(NF-1), $(NF)}'


date;kill -9 $(ps -ef | grep java | awk '/WebSrv/ {print $2}')

Queue Manager kill

date; ps -ef | grep "QM1 -x" | grep -v NF | grep  -v grep | awk '/QM1/ {print $2,$(NF-1), $(NF)}'
date;kill -9 $(ps -ef | grep "QM1 -x" |grep -v NF | grep  -v grep | awk '/QM1/ {print $2}')

DB Node shutdown

While traffic is running, the following command is used to shut down the DB service.

su – oracle

sqlplus / as sysdba

shutdown transactional;

Graceful shutdown of AS

While traffic is running, use WebSphere console to gracefully shut down and then start the application
servers one at a time.

Restart of VMs

While traffic is running, use the reboot command to restart the relevant box.

5 .4 HA tests with COB


While COB is running, kill AS nodes, Admin server, Node Agent and IHS one by one but only when the
previous kill has been restored. A DB node is shutdown during COB.

25 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

5 .5 DR Tests with online traffic


To run DR tests with online traffic:

1. Generate traffic with JMeter and let it run for 5 minutes.

2. Switchover or failover the DB, depending on the test:

a. If switchover test: execute switchover.

b. If failover test: run srvctl stop database –d t24db –o abort, wait a couple of minutes and
then execute failover.

3. Update the Windows host file to DR load balancer and then disable routing to live site load balancer.

5 .6 DR tests with COB


To run DR tests with COB:

1. Start COB on LIVE site and wait for until it gets to the Application stage.

2. Switchover or failover DB, depending on the test.

a. If switchover test: stop COB on live site first and then execute switchover.

b. If failover test: run srvctl stop database –d t24db –o abort, wait a couple of minutes and
then execute failover.

3. Continue COB on DR server.

a. Set Temenos Service Manager (TSM) and COB services to START.

b. Execute START.TSM from servlet.

26 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

6 Baseline tests
Running baseline tests involved:

1. Running a COB with no transactions added.

2. Going through 5000 JMeter cycles to populate the DB with records.

3. Creating a DB restore point which could be used for other COB tests.

4. Running the COB again and keeping measured performance as a baseline.

6 .1 Baseline test: COB against fresh database


COB was started in servlet mode with two COB agents running on all the three app layer nodes.

Start time End time Baseline duration Comments

09:36:44 10:07:46 31 mins 02 secs Times extracted from COMO files.

6 .2 Baseline test: COB on DB with added transactions


Before running COB, 5000 JMeter script cycles injected transactions to the system. COB was started in servlet
mode with two COB agents running on each of the three app layer nodes.

As the table shows, after 5000 JMeter script cycles, COB took nearly two minutes longer to complete. There
were no errors recorded.

Start time End time Baseline duration Comments

11:03:07 11:36:14 33 mins 14 secs Times extracted from COMO files.

27 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

7 Application layer HA Tests


7 .1 Kill AS processes on the App layer
During the test, all AS processes on the App layer were killed one at a time, giving enough time for them to
restart before killing the next. The expected behaviour for this test was as follows:

l NA should restart AS automatically.

l Traffic will be routed to the remaining nodes until failed AS has fully restarted.

l Traffic will be balanced between all nodes again once failed AS has fully recovered.

l Minimal traffic disturbance is to be expected.

KillAppAS.processes

APP PR1
[eg15ph09:root] / # date;ps -ef | grep java | awk '/AppSrv0/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 10:22:13 GMT 2018
8388872 eg15ph09Node01 AppSrv01
[eg15ph09:root] / # date;kill -9 $(ps -ef | grep java | awk '/AppSrv0/ {print $2}')
Tue Feb 20 10:24:38 GMT 2018
Process back up at: 10:24:41
=================== APP PR2
[eg15ph10:root] / # date;ps -ef | grep java | awk '/AppSrv0/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 10:27:48 GMT 2018
5374440 eg15ph10Node01 AppSrv02
[eg15ph10:root] / # date;kill -9 $(ps -ef | grep java | awk '/AppSrv0/ {print $2}')
Tue Feb 20 10:27:56 GMT 2018
Process back up at: 10:28:01
================================== APP PR3
[eg15ph11:root] / # date;ps -ef | grep java | awk '/AppSrv0/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 10:29:42 GMT 2018
8913256 eg15ph11Node01 AppSrv03
[eg15ph11:root] / # date;kill -9 $(ps -ef | grep java | awk '/AppSrv0/ {print $2}')
Tue Feb 20 10:29:47 GMT 2018
Process back up at: 10:29:48

Kill AS process Time Recovery time JMeter errors Adjusted errors Downtime (Approx. secs)

AppSrv01 10:24:38 10:24:41 5 1 3

AppSrv02 10:27:48 10:28:01 0 0 13

AppSrv03 10:29:42 10:29:48 0 0 6

28 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

7.1.1 JM eter transaction sum m ary

The image below shows a series of JMeter errors. Killing the AS in App1 server caused the Customer error,
which in turn triggered the Account and Cash deposit errors. The killing of AS in App2 and App3 servers
produced no errors.

The errors after 9:27 are due to the restart of the AS in the Web Servers, which is discussed separately in 8.1
Kill of AS process Web layer.

Transaction Expected Actual Missing Total Adjusted total

CUSTOMER 6000 5999 1

ACCOUNT 12000 11998 2 5 1

CASH DEPOSITS 12000 11998 2

Because one customer failed, the missing accounts comprise two accounts (local and foreign) and two cash
deposits (local and foreign). That means the adjusted total is 1.

29 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

7.1.2 System resources usage

Application Server 1 behaviour

At 10:24 the AS process was killed. This led to low CPU activity. The process was automatically restarted,
hence the increased CPU activity, due to initialisation.

The above diagram demonstrates what is happening to the Application Server.

30 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

After the killing of the AS process, network activity slows down to almost zero. Immediately after the restart
of the process, there is increased network activity due to message inflow and the initialisation of queues.

Similar behaviour is observed in the Application Servers 2 and 3.

31 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

32 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

7 .2 Graceful shutdown and start of AS Processes on App layer


While JMeter was running, the app servers were individually stopped and then restarted from the
WebSphere console. The expected behaviour for this test was as follows:

l Traffic will be routed to the remaining nodes until stopped AS has been restarted manually.

l Traffic will be balanced between all nodes once stopped AS has fully recovered.

l No traffic disturbance is expected.

GracefulShutDownAppLayerAS.txt

Stopping the App server from the WebSphere Web console


=====================================================
Then checking the start and up times using this command
on each App layer VM
====================================================
cd $NODE_HOME/logs/nodeagent/
grep "Detected server AppSrv" * | cut -f 2,11,12,13 -d " "
12:47:09:195 server AppSrv01 stopped
12:51:53:605 server AppSrv01 started
12:52:22:680 server AppSrv02 stopped
12:55:22:631 server AppSrv02 started
12:58:13:919 server AppSrv03 stopped
13:01:15:025 server AppSrv03 started

Graceful shutdown Recovery JMeter Adjusted Downtime (Approx.


Time
AS time errors errors minutes)

AppSrv01 12:47:09:195 12:51:53:605 0 0 5

AppSrv02 12:52:22:680 12:55:22:631 0 0 3

AppSrv03 12:58:13:919 13:01:15:025 0 0 3

Because this was a graceful kill, the app server had to wait for any transactions to be processed before it shut
down, which is why no errors were recorded.

33 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

7.2.1 JM eter transactions sum m ary

Graceful kill occurred during the highlighted period in the below image. The small disturbance at the start of
the test occurred outside the testing window and are due to no proper ramping up of the JMeter thread
execution.

7.2.2 DB records sum m ary


Transaction Expected Actual Missing Total Adjusted total

CUSTOMER 6000 6000 0

ACCOUNT 12000 12000 0 0 0

CASH DEPOSITS 12000 12000 0

A graceful restart resulted in no data loss being experienced as all pending transactions where completed
first,  and incoming transactions were routed to the remaining nodes.

34 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

7.2.3 System resources usage

The system resources usage is similar to that shown in 7.1.2 System resources usage. The only difference is
that due to the graceful shutdown, there is no automatic restarting of AS, which means there's time for the
activity to settle down to near-zero levels.

Application Server 1 behaviour

35 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Application Server 2 behaviour

36 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Application Server 3 behaviour

7 .3 Restart App Layer VM Nodes


While JMeter was running, all App layer hosts were rebooted one by one. The expected behaviour was:

37 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

l Some disturbance is expected.

l Cluster should recover to a normal state once the rebooted host has recovered.

App layer hosts Process affected JMeter errors Adjusted errors Time

Host 1 DM, NA, AS, 8 3 15:48:51

Host 2 NA and AS, 6 2 15:52:54

Host 3 NA and AS 4 2 15:56:25

7.3.1 Results sum m ary

The server reboot of App Layer nodes while injecting transaction through JMeter resulted in some loss of
data.

38 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

7.3.2 JM eter transaction sum m ary

The image below shows the three events of the application servers’ reboot. The disturbance at the beginning
of the test is due to the sudden activity from multiple threads.

We corrected the stress method in subsequent tests by having a two-minute ramp-up period where
WebSphere was allowed some time to initialise its connection pools.

39 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

7.3.3 DB records sum m ary

Because 3 customers failed, we would expect 6 accounts and 6 cash deposits records to be missing. Instead,
there were an additional 2 account and 2 cash deposits records missing. The 2 missing cash deposits records
are explained by the 2 missing accounts. Therefore, the adjusted total is 3+2+0=5.

Transaction Expected Actual Missing Total Adjusted total

CUSTOMER 6000 6000 3

ACCOUNT 12000 11992 8 19 5

CASH DEPOSITS 12000 11992 8

7.3.4 System resources usage

We have captured uninterrupted resource utilisation diagrams that encompass the VM restart event.
Although impact was obvious in other diagrams as well, we only list the 2 most interesting ones: Disk and
network activity.

40 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Application Server 1

The actual restart was executed at 15:48:51 and finished at 15:48:52. The server was again operationally ready
at 15:51:30. The initialisation of the various resources of the VM can be clearly seen by this big spike in the
disk activity.

The opposite behaviour is observed as expected in network activity. While the VM is restarting or initialising
resources, communications will decrease; after that we expect an increase higher than before the restart, as
the listeners try to consume all remaining messages that did not timeout.

41 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Similar behaviour was exhibited by the other two Application Servers.

Application Server 2

The restart was executed at 15:52:54 and finished at 15:52:55. The server was again operationally ready at
15:55:51. The behaviour is similar to Application server 1.

42 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Application Server 3

The restart was executed at 15:56:26 and finished at 15:56:27. The Server was again operationally ready at
15:59:25. The behaviour is similar to Application Server 1 and 2.

43 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

44 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

8 Web layer HA tests


8 .1 Kill of AS process Web layer
During this test we killed all three AS processes, one at a time, allowing the killed process to recover before
moving on to the next one. The expected behaviour for this test was as follows:

l NA should restart AS automatically.

l Traffic will be routed to the remaining nodes until failed AS has fully restarted.

l Traffic will be balanced between all nodes again once failed AS has fully recovered.

l Minimal traffic disturbance is expected.

KillWebAS.processes

================================== WAS PR1


[eg15ph06:root] / # date;ps -ef | grep java | awk '/WebSrv0/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 10:31:53 GMT 2018
8192258 eg15ph06Node01 WebSrv01
[eg15ph06:root] / # date;kill -9 $(ps -ef | grep java | awk '/WebSrv0/ {print $2}')
Tue Feb 20 10:32:13 GMT 2018
Process back up at:10:32:15
ERROR 8 session timeouts for user DEMOTEST16 during its current jmeter loop.
================================== WAS PR2
[eg15ph07:root] / # date;ps -ef | grep java | awk '/WebSrv0/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 10:35:18 GMT 2018
8061438 eg15ph07Node01 WebSrv02
[eg15ph07:root] / # date;kill -9 $(ps -ef | grep java | awk '/WebSrv0/ {print $2}')
Tue Feb 20 10:35:32 GMT 2018
Process back up at:10:35:34
================================== WAS PR3
[eg15ph08:root] / # date;ps -ef | grep java | awk '/WebSrv0/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 10:37:01 GMT 2018
5767622 eg15ph08Node01 WebSrv03
[eg15ph08:root] / # date;kill -9 $(ps -ef | grep java | awk '/WebSrv0/ {print $2}')
Tue Feb 20 10:37:04 GMT 2018
Process back up at: 10:37:06

Kill Application Server Recovery JMeter Adjusted Downtime (Approx.


Time
process time errors errors secs)

WebSrv01 10:32:13 10:32:15 10 2 2

WebSrv02 10:35:32 10:35:34 1 1 2

45 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Kill Application Server Recovery JMeter Adjusted Downtime (Approx.


Time
process time errors errors secs)

WebSrv03 10:37:04 10:37:06 1 1 2

8.1.1 JM eter transaction sum m ary

There were some session timeout errors recorded in JMeter as the Web Layer AS process went down.

The highlighted failed transaction spikes represent the two failed customer creations, one of which is in the
screenshot above. These were due to the killed AS in Web1. The last 2 spikes are due to the killed AS in Web2
and Web3 servers.

46 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

8.1.2 DB records sum m ary


Transaction Expected Actual Missing Total Adjusted total

CUSTOMER 6000 5998 2

ACCOUNT 12000 11996 4 11 3

CASH DEPOSITS 12000 11995 5

From the above table, since 2 customers failed, we expected that to be followed by 4 accounts and 4 cash
deposits missing records. Instead, there was an additional 1 cash deposit missing. That means the adjusted
total is 2+0+1=3.

8.1.3 System resources usage

Restarting the Web Server did not have an impact on most resource utilisation diagrams. The CPU
disturbance was in the level of magnitude of other random events, while in operation. The reason behind this
is that the servers are sized to handle much bigger traffic loads and the CPU can handle easily all tasks.

Network utilisation is not helpful either, because the calls keep coming to the VM, whether the Web Server is
up or not. There are peaks in the diagram, but not significant enough to demonstrate an extraordinary event.

47 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

The disk diagrams on the other hand are a different case. There we can see the restart of the AS process quite
nicely. Following, are disk diagrams along with the CPU and network, as an interesting comparison between
Web and Application Server behaviour. See also 7 Application layer HA Tests

Web Server 1

48 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Web Server 2

49 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

50 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Web Server 3

51 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

8 .2 Graceful shutdown and start of AS Processes on Web layer


While JMeter was running, the app servers were individually stopped and then restarted from the
WebSphere console. The expected behaviour for this test was as follows:

l Traffic will be routed to the remaining nodes until stopped AS has been restarted manually.

l Traffic will be balanced between all nodes once stopped AS has fully recovered.

l No traffic disturbance is expected.

GracefulShutDownWebLayerAS.txt

Stopping the App server from the WebSphere Web console


=====================================================
Then checking the start and up times suing this command
on each Web layer VM
====================================================
cd $NODE_HOME/logs/nodeagent/
grep "Detected server AppSrv" * | cut -f 2,11,12,13 -d " "
13:04:07:871 server WebSrv01 stopped
13:07:19:926 server WebSrv01 started
13:09:08:705 server WebSrv02 stopped
13:11:43:097 server WebSrv02 started
13:13:14:274 server WebSrv03 stopped
13:18:28:837 server WebSrv03 started

52 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Kill AS process Time Recovery time JMeter errors Adjusted errors Downtime (Approx. mins)

WebSrv01 13:04:07:871 13:07:19:926 0 0 3

WebSrv02 13:09:08:705 13:11:43:097 0 0 3

WebSrv03 13:13:14:274 13:18:28:837 0 0 4

A graceful shutdown of Web Layer AS process did not cause any loss of transactions. The below screenshot
shows healthy JMeter results tree as seen for the duration of the test.

8.2.1 JM eter transactions sum m ary

Graceful kill occurred during the highlighted period in the below image. The small disturbance at the start of
the test occurred outside the testing window and are due to no proper ramping up of the JMeter thread
execution.

53 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

8.2.2 DB records sum m ary


Transaction Expected Actual Missing Total Adjusted total

CUSTOMER 6000 6000 0

ACCOUNT 12000 12000 0 0 0

CASH DEPOSITS 12000 12000 0

A graceful restart resulted in no data loss being experienced as all pending transactions where completed
first,  and incoming transactions were routed to the remaining nodes.

8.2.3 System resources usage

The system resources usage here is similar to that shown in 8.1.3 System resources usage. The only
difference is that due to the graceful shutdown, there is no automatic restarting of AS, and thus time for the
activity to settle down to near-zero levels.

The network is not affected, unlike the Application Server shutdown, due to the continuous requests coming
from the injector. In the case of Application Server, the messages are pulled from MQ, not pushed by an
injector.

Note the low CPU utilisation (5-10%) because of the over-sized capacity of the Web Server VMs.

54 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Web Server 1

Web Server 2

55 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Web Server 3

8 .3 Deployment Manager failure on App and Web layers


The expected behaviour from this test was:

l The admin console will be down.

l No traffic disturbance should occur and there would be no impact on existing application servers.

l The admin console should become available again once the DM has fully recovered.

dmkill.txt

Test-AT3a times and commands


date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'
date;kill -9 $(ps -ef | grep java | awk '/dmgr/ {print $2}')
app -dm============================
[eg15ph09:root] / # date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:44:22 GMT 2018
8651064 eg15ph09CellManager01 dmgr
9634080 dmgr -dmgr
[eg15ph09:root] / # date;kill -9 $(ps -ef | grep java | awk '/dmgr/ {print $2}')

56 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Tue Feb 20 14:44:28 GMT 2018

was-dm=================================
[eg15ph06:root] / # date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:46:49 GMT 2018
9699660 dmgr -dmgr
10158454 eg15ph06CellManager01 dmgr
[eg15ph06:root] / # date;kill -9 $(ps -ef | grep java | awk '/dmgr/ {print $2}')
Tue Feb 20 14:46:53 GMT 2018

Kill process Time

App Layer DM 14:44:28

Web Layer DM 14:46:53

8.3.1 Result sum m ary

As expected, the WebSphere console was not accessible when the Deployment Manager process was down
and there was not traffic disturbance.

57 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

8.3.2 System resources usage

The Deployment Manager is not responsible for any traffic, so the network is not affected. CPU and disk
activity though is, since killing and restarting the process entails initialisation of resources and the update of
state from the servers in the cluster.

Web Server 1

58 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Application Server 1

8 .4 Node Agent failure on App and Web layers


The test was executed on all NAs in this sequence:

59 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

l Kill NA process on one node.

l Wait couple of minutes and restart it manually.

l Do the same on the other nodes.

The expected behaviour was:

l No traffic disturbance should occur and no impact on existing ASs or the Admin server.

l The admin console should stay available.

Kill_NA_on_App_n_Web_layer.txt

Test-AT3a times and commands


date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'
date;kill -9 $(ps -ef | grep java | awk '/dmgr/ {print $2}')
app -dm============================
[eg15ph09:root] / # date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:44:22 GMT 2018
8651064 eg15ph09CellManager01 dmgr
9634080 dmgr -dmgr
[eg15ph09:root] / # date;kill -9 $(ps -ef | grep java | awk '/dmgr/ {print $2}')
Tue Feb 20 14:44:28 GMT 2018

was-dm=================================
[eg15ph06:root] / # date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:46:49 GMT 2018
9699660 dmgr -dmgr
10158454 eg15ph06CellManager01 dmgr
[eg15ph06:root] / # date;kill -9 $(ps -ef | grep java | awk '/dmgr/ {print $2}')
Tue Feb 20 14:46:53 GMT 2018

Test-AT3b times and commands


date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
date;/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/bin/stna.sh

=========app pr1
[eg15ph09:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:50:16 GMT 2018
9765202 eg15ph09Node01 nodeagent
[eg15ph09:root] / # date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Feb 20 14:50:27 GMT 2018
[eg15ph09:root] / # (NF-1), $(NF)}'date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1),
$(NF)}' <
Tue Feb 20 14:51:24 GMT 2018
[eg15ph09:root] / # date;/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/bin/stna.sh
Tue Feb 20 14:51:51 GMT 2018
launching server using startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent

60 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

ADMU3200I: Server launched. Waiting for initialization status.


ADMU3000I: Server nodeagent open for e-business; process id is 8061268
exit code: 0
[eg15ph09:root] / #
====================app pr2==========
date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:53:58 GMT 2018
9306558 eg15ph10Node01 nodeagent
[eg15ph10:root] / # date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Feb 20 14:54:16 GMT 2018
[eg15ph10:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:54:24 GMT 2018
[eg15ph10:root] / # date;/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/bin/stna.sh
Tue Feb 20 14:55:04 GMT 2018
launching server using startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent
ADMU3200I: Server launched. Waiting for initialization status.
ADMU3000I: Server nodeagent open for e-business; process id is 9830806
exit code: 0
[eg15ph10:root] / #

============app pr 3==========================
date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:56:15 GMT 2018
9240874 eg15ph11Node01 nodeagent
[eg15ph11:root] / # date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Feb 20 14:56:32 GMT 2018
[eg15ph11:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 14:56:38 GMT 2018
[eg15ph11:root] / # date;/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/bin/stna.sh
Tue Feb 20 14:57:50 GMT 2018
launching server using startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent
ADMU3200I: Server launched. Waiting for initialization status.
ADMU3000I: Server nodeagent open for e-business; process id is 6226370

=was pr 1====================
[eg15ph06:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 15:00:34 GMT 2018
5898516 eg15ph06Node01 nodeagent
[eg15ph06:root] / # date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Feb 20 15:00:36 GMT 2018
[eg15ph06:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 15:00:44 GMT 2018
[eg15ph06:root] / # date;/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/bin/stna.sh
Tue Feb 20 15:01:55 GMT 2018
launching server using startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent
ADMU3200I: Server launched. Waiting for initialization status.
ADMU3000I: Server nodeagent open for e-business; process id is 11272492
exit code: 0

was pr 2===================
[eg15ph07:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'

61 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Tue Feb 20 15:02:39 GMT 2018


8388910 eg15ph07Node01 nodeagent
[eg15ph07:root] / # date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Feb 20 15:02:42 GMT 2018
[eg15ph07:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 15:03:35 GMT 2018
[eg15ph07:root] / # date;/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/bin/stna.sh
Tue Feb 20 15:04:12 GMT 2018
launching server using startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent
ADMU3200I: Server launched. Waiting for initialization status.
ADMU3000I: Server nodeagent open for e-business; process id is 9503056
exit code: 0
[eg15ph07:root] / #

=======waspr 3============

[eg15ph08:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 15:05:23 GMT 2018
10027350 eg15ph08Node01 nodeagent
[eg15ph08:root] / # date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Feb 20 15:05:52 GMT 2018
[eg15ph08:root] / # date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Feb 20 15:06:00 GMT 2018
[eg15ph08:root] / #
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
[eg15ph08:root] / # date;/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/bin/stna.sh
Tue Feb 20 15:07:29 GMT 2018
launching server using startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent
ADMU3200I: Server launched. Waiting for initialization status.
ADMU3000I: Server nodeagent open for e-business; process id is 8978700
exit code: 0
[eg15ph08:root] / #

Kill process Time JMeter errors Adjusted errors

Kill Process Time JMeter Errors Adjusted Errors

AppNA1 14:50:27 0 0

AppNA2 14:54:16 0 0

AppNA3 14:56:32 0 0

WebNA1 15:00:36 0 0

WebNA2 15:02:42 0 0

WebNA3 15:05:52 0 0

62 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

8.4.1 Test sum m ary

There was no data loss resulting from kill node agents.

8.4.2 App layer system resources usage

Application Server 1

63 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

The disk utilisation diagram for Application Server 1 is significantly different from Server 2 and 3. This is
because, Server 1 is running the Deployment Manager as well.

Application Server 2

64 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Notice the difference in the disk utilisation profile in comparison to Application Server 1.

Application Server 3

65 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

8.4.3 W eb Layer system resources Usage

Web Server 1

66 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

The disk utilisation diagram for Web Server 1 is significantly different from Servers 2 and 3. This is because
Server 1 is running the Deployment Manager as well.

Web Server 2

67 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Notice the difference in the disk utilisation profile, in comparison to Web Server 1.

Web Server 3

68 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

8 .5 Shutdown and start of IHS processes on Web layer


For this test, IHS was configured to restart automatically in case of failure.

With IHS configured to restart itself upon failure and session replication enabled in IHS, JMeter was started
and the following behaviour was expected:

l Traffic will be routed to the remaining IHS node until failed IHS has fully recovered.

l Traffic will be balanced between the two IHS nodes once failed IHS has fully recovered.

l WebSphere app servers will remain balanced.

l There should be no traffic disturbance when session replication is in place.

8.5.1 Kill IHS1 procedure

kill-shutdown-webserver1.txt

[eg15ph06:root] / # grep | awk '/http/ {print $2}');for x in $i; do echo "$x";done


<
Wed Feb 14 14:23:23 GMT 2018
6291944
7995682
8454518
9109848
9896296
10223938
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / # grep | awk '/http/ {print $2}');for x in $i; do kill -9 "$x";done
<
Wed Feb 14 14:23:51 GMT 2018
[eg15ph06:root] / # ps -ef | grep http
root 6619546 1 0 14:08:00 - 0:00 Xvnc :1 -desktop X -httpd
/opt/freeware/vnc/classes -auth //.Xauthority -geometry 1024x768 -depth 8 -rfbwait 120000 -rfbauth
//.vnc/passwd -rfbport 5901 -nolisten local -fp
/usr/lib/X11/fonts/,/usr/lib/X11/fonts/misc/,/usr/lib/X11/fonts/75dpi/,/usr/lib/X11/fonts/100dpi/,/u
sr/lib/X11/fonts/ibm850/,/usr/lib/X11/fonts/Type1/
nobody 6750532 8061240 0 14:25:55 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
nobody 7274872 8061240 0 14:25:58 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
root 8061240 1 0 14:25:53 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
nobody 8257840 8061240 0 14:25:53 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d

69 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf


root 9109850 10748376 0 14:26:30 pts/0 0:00 grep http
nobody 9175510 8061240 0 14:25:54 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
nobody 10223942 8061240 0 14:25:54 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / # cd cd /u01/3rdParty/As/IBM/HTTPServer/bin/
ksh: cd: 0403-011 The specified substitution is not valid for this command.
[eg15ph06:root] / # cd /u01/3rdParty/As/IBM/HTTPServer/bin/
[eg15ph06:root] /bin #
[eg15ph06:root] /bin #
[eg15ph06:root] /bin #
[eg15ph06:root] /bin #

Time JMeter errors Adjusted errors

14:23:51 7 1

8.5.2 Kill IHS2 procedure

kill-shutdown-webserver2.txt

[eg15ph07:t24user] /Temenos $ date;i=$(ps -ef | grep httpd | grep -v Xvnc | grep -v grep | awk
'/http/ {print $2}');for x in $i; do echo "$x";done
Wed Feb 14 14:30:22 GMT 2018
4653514
6422986
6947320
8323352
8716696
9634142
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $ date;i=$(ps -ef | grep httpd | grep -v Xvnc | grep -v grep | awk
'/http/ {print $2}');for x in $i; do kill -9 "$x";done
Wed Feb 14 14:30:53 GMT 2018
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $ date; /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
Wed Feb 14 14:32:29 GMT 2018
[eg15ph07:t24user] /Temenos $ cd /u01/3rdParty/As/IBM/HTTPServer/bin
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $ date; apachectl -k graceful
Wed Feb 14 14:36:23 GMT 2018
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $ ps -ef | grep http
t24user 6422990 9634156 0 14:36:24 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf

70 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

t24user 6947098 9634156 0 14:36:23 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d


/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
root 7602520 1 0 14:07:58 - 0:00 Xvnc :1 -desktop X -httpd
/opt/freeware/vnc/classes -auth //.Xauthority -geometry 1024x768 -depth 8 -rfbwait 120000 -rfbauth
//.vnc/passwd -rfbport 5901 -nolisten local -fp
/usr/lib/X11/fonts/,/usr/lib/X11/fonts/misc/,/usr/lib/X11/fonts/75dpi/,/usr/lib/X11/fonts/100dpi/,/u
sr/lib/X11/fonts/ibm850/,/usr/lib/X11/fonts/Type1/
t24user 8323366 9634156 0 14:36:24 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
t24user 8716738 9568694 0 14:36:37 pts/0 0:00 grep http
t24user 9634156 1 0 14:32:29 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
[eg15ph07:t24user] /bin $

Time JMeter errors Adjusted errors

14:30:53 0 0

8.5.3 Results sum m ary

The expected result was tno loss of transactions but we saw a proxy error during a customer creation which
resulted in loss of one customer and its corresponding accounts  and cash deposits downstream.

8.5.4 DB records sum m ary


Transaction Expected Actual Missing Total Adjusted total

CUSTOMER 6000 5999 1

ACCOUNT 12000 11998 2 5 1

DEPOSIT 12000 11998 2

As result of one customer not being created, there was a domino effect errors on the remaining transactions
depending on that missing customer.

71 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

8.5.5 System resources usage

The sizing is such that the Web Servers are not stressed completely. That does not allow us to draw
conclusions on CPU & disk utilisation when affecting such light components as IHS. The network utilisation
diagram is the one that gives the most accurate picture.

IHS1

The failure event takes place in the circled portion in the above diagram. It quite clearly shows how traffic
drops to zero and then resumes to normal. Please note that the difference in the network activity between
IHS 1 and IHS 2 is due to the fact that these are housed in the VMs of Web Server 1 and 2 respectively. This
means that along with IHS activity we have NA activity but also DM activity additionally for Server 1.

72 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

IHS2

Similar to IHS 1 above, the circled period shows the drop in network activity. After the restart, normal
operation resumes.

8 .6 Graceful restart of IHS


While JMeter running, IHS was shutdown gracefully. The expected behaviour was that:

l Traffic would be routed to the remaining IHS until the failed host was back.

l WebSphere app servers would remain balanced.

l Traffic would be balanced between the two IHS instances once the failed IHS had fully recovered.

l There should be no traffic disturbance when session replication was in place.

8.6.1 IHS1 graceful shutdown procedure

graceful-shutdown-webserver1.txt

[eg15ph06:root] / # grep | awk '/http/ {print $2}');for x in $i; do echo "$x";done


<

73 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Wed Feb 14 14:23:23 GMT 2018


6291944
7995682
8454518
9109848
9896296
10223938
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / # grep | awk '/http/ {print $2}');for x in $i; do kill -9 "$x";done
<
Wed Feb 14 14:23:51 GMT 2018
[eg15ph06:root] / # ps -ef | grep http
root 6619546 1 0 14:08:00 - 0:00 Xvnc :1 -desktop X -httpd
/opt/freeware/vnc/classes -auth //.Xauthority -geometry 1024x768 -depth 8 -rfbwait 120000 -rfbauth
//.vnc/passwd -rfbport 5901 -nolisten local -fp
/usr/lib/X11/fonts/,/usr/lib/X11/fonts/misc/,/usr/lib/X11/fonts/75dpi/,/usr/lib/X11/fonts/100dpi/,/u
sr/lib/X11/fonts/ibm850/,/usr/lib/X11/fonts/Type1/
nobody 6750532 8061240 0 14:25:55 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
nobody 7274872 8061240 0 14:25:58 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
root 8061240 1 0 14:25:53 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
nobody 8257840 8061240 0 14:25:53 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
root 9109850 10748376 0 14:26:30 pts/0 0:00 grep http
nobody 9175510 8061240 0 14:25:54 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
nobody 10223942 8061240 0 14:25:54 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / #
[eg15ph06:root] / # cd cd /u01/3rdParty/As/IBM/HTTPServer/bin/
ksh: cd: 0403-011 The specified substitution is not valid for this command.
[eg15ph06:root] / # cd /u01/3rdParty/As/IBM/HTTPServer/bin/
[eg15ph06:root] /bin #
[eg15ph06:root] /bin #
[eg15ph06:root] /bin #
[eg15ph06:root] /bin #
[eg15ph06:root] /bin # date; apachectl -k graceful
[eg15ph06:root] /bin # date; apachectl -k graceful
Wed Feb 14 14:34:09 GMT 2018
[eg15ph06:root] /bin # ps -ef | grep http
root 6619546 1 0 14:08:00 - 0:00 Xvnc :1 -desktop X -httpd
/opt/freeware/vnc/classes -auth //.Xauthority -geometry 1024x768 -depth 8 -rfbwait 120000 -rfbauth
//.vnc/passwd -rfbport 5901 -nolisten local -fp
/usr/lib/X11/fonts/,/usr/lib/X11/fonts/misc/,/usr/lib/X11/fonts/75dpi/,/usr/lib/X11/fonts/100dpi/,/u
sr/lib/X11/fonts/ibm850/,/usr/lib/X11/fonts/Type1/
nobody 6750534 8061240 0 14:34:16 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
root 7274874 10748376 0 14:34:19 pts/0 0:00 grep http
nobody 7995748 8061240 0 14:34:11 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
root 8061240 1 0 14:25:53 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
nobody 8257842 8061240 0 14:34:10 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf

74 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

nobody 9175512 8061240 0 14:34:10 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d


/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
[eg15ph06:root] /bin #

Shutdown process Time

IHS1 14:34:09

8.6.2 IHS2 graceful shutdown procedure

graceful-shutdown-webserver2.txt

[eg15ph07:t24user] /Temenos $ date;i=$(ps -ef | grep httpd | grep -v Xvnc | grep -v grep | awk
'/http/ {print $2}');for x in $i; do echo "$x";done
Wed Feb 14 14:30:22 GMT 2018
4653514
6422986
6947320
8323352
8716696
9634142
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $ date;i=$(ps -ef | grep httpd | grep -v Xvnc | grep -v grep | awk
'/http/ {print $2}');for x in $i; do kill -9 "$x";done
Wed Feb 14 14:30:53 GMT 2018
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $
[eg15ph07:t24user] /Temenos $ date; /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
Wed Feb 14 14:32:29 GMT 2018
[eg15ph07:t24user] /Temenos $ cd /u01/3rdParty/As/IBM/HTTPServer/bin
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $ date; apachectl -k graceful
Wed Feb 14 14:36:23 GMT 2018
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $
[eg15ph07:t24user] /bin $ ps -ef | grep http
t24user 6422990 9634156 0 14:36:24 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
t24user 6947098 9634156 0 14:36:23 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
root 7602520 1 0 14:07:58 - 0:00 Xvnc :1 -desktop X -httpd
/opt/freeware/vnc/classes -auth //.Xauthority -geometry 1024x768 -depth 8 -rfbwait 120000 -rfbauth
//.vnc/passwd -rfbport 5901 -nolisten local -fp
/usr/lib/X11/fonts/,/usr/lib/X11/fonts/misc/,/usr/lib/X11/fonts/75dpi/,/usr/lib/X11/fonts/100dpi/,/u
sr/lib/X11/fonts/ibm850/,/usr/lib/X11/fonts/Type1/
t24user 8323366 9634156 0 14:36:24 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
t24user 8716738 9568694 0 14:36:37 pts/0 0:00 grep http
t24user 9634156 1 0 14:32:29 - 0:00 /u01/3rdParty/As/IBM/HTTPServer/bin/httpd -d
/u01/3rdParty/As/IBM/HTTPServer -k start -f /u01/3rdParty/As/IBM/HTTPServer/conf/httpd.conf
[eg15ph07:t24user] /bin $

75 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Shutdown process Time

IHS2 14:36:23

8.6.3 JM eter transaction sum m ary

When IHS was gracefully shutdown, there was no data loss experienced. The initial disturbance in the
beginning of the test, is, as in a few other tests, the result of not ramping-up the injected load from JMeter.

Transaction Expected Actual Missing Total Adjusted total

CUSTOMER 6000 6000 0

ACCOUNT 12000 12000 0 0 0

DEPOSIT 12000 12000 0

8.6.4 System resources usage

This test does not show any difference is workload as the IHS is shutdown gracefully, thus is able to
rebalance the traffic seamlessly to the remaining IHS. The fact that both the IHS are housed on the same VMs
as the Web Servers, means that any slight difference in CPU and disk activity is going to be masked by the
other processes. The network activity is not impacted because the messages are rerouted to reach IHS 2 and
then are balanced between Web Servers normally.

76 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

The following diagrams demonstrate this. In the same way as 8.5 Shutdown and start of IHS processes on
Web layer, the DM in Web Server 1 masks the activity based purely on message exchange. Network activity
between Web Server 2 and 3 is much more similar.

Web Server 1

Web Server 2

Web Server 3

77 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

8 .7 Restart of Web layer VM Nodes


While JMeter was running, all the web layer hosts were rebooted one at a time, and the expected behaviour
was:

l Some disturbance.

l Cluster should recover to a normal state after each rebooted host has recovered.

Web layer hosts Process affected JMeter errors JMeter adjusted Time

Host 1 DM, NA, AS, IHS 25 10 16:00:10

Host 2 NA and AS, IHS 3 2 16:05:15

Host 3 NA and AS Not rebooted

8.7.1 Result sum m ary

Server reboot of Web Layer nodes while injecting transaction through JMeter resulted in proxy error being
reported and some loss of data.

78 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

79 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

8.7.2 JM eter transaction sum m ary

In the following image we can see two purple spikes that represent two errors followed by a third one after a
while. These are caused by the restart of the Web2 node. Notice that the third error is not caused by the
actual restart but by one of the two failed transactions previously.

8.7.3 DB Records sum m ary


Transaction Expected Actual Missing Total Adjusted total

CUSTOMER 6000 5997 3

ACCOUNT 12000 11992 8 19 5

CASH DEPOSITS 12000 11992 8

Given that 3 customers failed, we expect that to be followed by 6 accounts and 6 cash deposits missing
records. Instead, there were an additional 2 account and 2 cash deposits records missing. The 2 missing cash
deposits records are explained by the 2 missing accounts. Therefore, the adjusted total is 3+2+0=5.

80 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

8.7.4 System resources usage

This test was carried in conjunction to the restart of App layer VM nodes, as mentioned in 8.7 Restart of
Web layer VM Nodes. During the test, the behaviour was as expected for all servers that were restarted.

Just before the restart of the last Web Server, all the transactions were pumped and thus the test was de facto
over. We took the decision not to repeat the test just to see the Web Server 3 restarting as we had already
proven the resilience of the system. Following are the most relevant resource utilisation diagrams for Web
Servers 1 and 2.

Web Server 1

Initiated the restart at 16:00:10. Stopped at 16:00:11 and restarted at 16:02:47.

81 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Web Server 2

Initiated the restart at 16:05:15. Stopped at 16:05:16 and restarted at 16:07:53.

82 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

The CPU utilisation of Web Server 2 and 3 were typically half of what Web Server 1 was. This seemed to be
the case for all our tests. The reason behind this, is that Web Server 1 is the server running DM on behalf of
the cluster.

83 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

The network activity observed after the restart is not typical. We would expect it to bounce right up to the
previous levels. The reason behind this, is that the test was closing to an end, and the transactions were
already ramping down and the parallel sessions from JMeter were slowly reaching zero according to the test
rules.

84 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

9 Messaging layer HA tests


9 .1 IBM MQ failures
The procedure for this test was as follows:

1. Kill QM process on host 1. Expected behaviour:

l Standby QM on host 2 will be started.


l All messages will be handled by QM on host 2.
l Some failure is expected if marooned messages timeout.

2. Start QM process on host 1. Expected behaviour:

l QM in host 1 starts in standby mode.


l No disturbance is expected

3. Kill QM process on host 2. Expected behaviour:

l Standby QM on host 1 will take over.


l All messages will be handled by QM on host 1
l Some failure is expected if marooned messages timeout

4. Start QM process on host 2. Expected behaviour:

l QM in host 2 starts in standby mode.


l No disturbance is expected

5. Reboot host 1. Expected behaviour is similar to killing QM on host 1.

Downtime is the time taken by the standby to become active. For ease of reference, QM on host 1 will be QM
1 and QM on host 2, QM 2.

85 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

9.1.1 Kill Q ueue M anager and M Q VM restart procedure

killing-QM1.txt

===========MQ Status of both host before kill


[eg15ph04:root] / # dspmq -m QM1
QMNAME(QM1) STATUS(Running)
[eg15ph04:root] / #
[eg15ph05:root] / # dspmq -m QM1
QMNAME(QM1) STATUS(Running as standby)
[eg15ph05:root] / #
===============killing QM1 ======
dspmq -m QM1
[eg15ph04:mqm] /mqm # ps -ef | grep "QM1 -x" | grep -v NF | grep -v grep | awk '/QM1/ {print
$2,$(NF-1), $(NF)}'
7930338 -u mqm
[eg15ph04:mqm] /mqm $
date;kill -9 $(ps -ef | grep "QM1 -x" |grep -v NF | grep -v grep | awk '/QM1/ {print $2}')
===========Status QM1and MQ2 after killing QM1=====================================
[eg15ph05:root] / # dspmq -m QM1
QMNAME(QM1) STATUS(Running)
[eg15ph05:root] / #

[eg15ph04:root] / # dspmq -m QM1


QMNAME(QM1) STATUS(Running elsewhere)
[eg15ph04:root] / #

JMeter errors Adjusted errors Downtimes (secs)

6 2 9

killing-QM2.txt

===========MQ1 restart=====================================
[eg15ph04:root] / #strmqm -x QM1
[eg15ph04:root] / # dspmq -m QM1
QMNAME(QM1) STATUS(Running as standby)
[eg15ph04:root] /
[eg15ph05:root] / # dspmq -m QM1
QMNAME(QM1) STATUS(Running)
[eg15ph05:root] / #
========Kill MQ 2=====================
[eg15ph05:mqm] /mqm # ps -ef | grep "QM1 -x" | grep -v NF | grep -v grep | awk '/QM1/ {print
$2,$(NF-1), $(NF)}'
7930338 -u mqm
[eg15ph05:mqm] /mqm #
date;kill -9 $(ps -ef | grep "QM1 -x" |grep -v NF | grep -v grep | awk '/QM1/ {print $2}')
[eg15ph05:root] / # dspmq -m QM1
QMNAME(QM1) STATUS(Running elsewhere)
[eg15ph05:root] / # dspmq -m QM1
QMNAME(QM1) STATUS(Running)
[eg15ph05:root] / #

86 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

JMeter errors Adjusted errors Downtimes (secs)

44 28 26

rebooting-VM-(QM1).txt

======================Rebooting Host 1 =======================


[eg15ph05:root] / #strmqm -x QM1
eg15ph05:root] / # dspmq -m QM1
QMNAME(QM1) STATUS(Running as standby)
[eg15ph04:root] / # dspmq -m QM1
QMNAME(QM1) STATUS(Running)
[eg15ph04:root] / #
[eg15ph04:root] / #reboot
[eg15ph05:root] / # dspmq -m QM1
QMNAME(QM1)

JMeter errors Adjusted errors Downtimes (secs)

2 1 31

9.1.2 Result sum m ary

We experienced data loss after killing the Queue Manager. This would have happened during the switch
over to the standby Queue Manager. Below is a sample of the errors as seen in JMeter.

87 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

9.1.3 Records count

The following tables show the data loss for the different transaction done through JMeter.

We observed that failover from QM on host 2 (QM 2) to QM on host 1 (QM 1) took longer and produced
more errors than failover from QM 1 to QM 2.

Failover from QM 1 to QM 2

Transaction Expected Actual Missing Total Adjusted total

CUSTOMER 6000 5999 1

ACCOUNT 12000 11998 2 6 2

DEPOSITS 12000 11997 3

Failover from QM 2 to QM 1

Transaction Expected Actual Missing Total Adjusted total

CUSTOMER 6000 5998 2

ACCOUNT 12000 11990 10 42 28

DEPOSITS 12000 11970 30

88 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Restart VM 1 (host 1)

Transaction Expected Actual Missing Total Adjusted total

CUSTOMER 6000 6000 0

ACCOUNT 12000 11999 1 2 1

DEPOSITS 12000 11999 1

9.1.4 JM eter aggregate sum m ary

89 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

9.1.5 JM eter transaction sum m ary

This diagram is indicative of the disturbance in the transactions. At the time of the QM 1 failure we have two
transactions failing, which happened to be en route from QM 1. Afterwards we see ripple effects because it so
happened that one of the failing transactions was a Customer creation.

As there is no logic in JMeter script to run transaction conditionally, all subsequent Account and Cash
Deposit transactions for this missing customer fail because of the Temenos Core Banking application logic.

90 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

9.1.6 System resources usage

Time of QM 1 kill is 11:04:07. The switch from the primary QM to the Stand-by QM can be seen in the
following two network utilisation diagrams from QM 1 and QM 2.

QM 1

The red line indicates the time of failure of QM 1. As expected all network activity drops to near zero.

91 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

QM 2

Immediately after the QM 1 failure, QM 2 picked up the load. Web Servers were not disrupted for too long.
Web1 shows a hiccup at the time of the event. Similarly, Web2 and 3 show similar behaviour.

Web1

Similar to Web Servers, App servers have a small disruption at the time of the event.

92 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

App2

The effects are delayed a bit as we move lower to the architecture. The DB if it has any impact at all, it is
mostly lost in the runtime turbulence.

DB1

93 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Reverting to the first QM has a similar effect. Now QM 2 is killed so that QM 1 will take over.

QM 2

QM 2 was killed at 10:37:35. The red line signifies this event.

QM 1

94 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Immediately after QM 2 going down, we have QM 1 taking over again. Web Servers have the usual
disruption at the time of the event.

Web1

We see similar network utilisation diagrams for Web2 and 3. Application Servers have a similar disruption as
with Web Servers. For example, observe App1:

95 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

This time the DB servers are impacted by the QM failovers. In the previous test when QM 1 failed to QM 2,
there was no or minimal disruption. In addition, we do not see any time shifting of the disruption as we
move lower to the architecture.

DB1

96 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

10 Data layer HA tests


1 0 .1 DB node failure
In preliminary tests we found that after the restart of the node with the simulated failure, the load was not
distributed equally between the servers because of the following two reasons:

l The load was not high enough as the DB servers were sized to be more than capable of handling each
one alone, one JMeter instance.

l The messages from a JMeter thread tended to be routed to the same DB node, as long as there was
capacity to serve them.

To overcome this effect and to prove the rebalancing of the DB nodes, we used 5 instances of JMeter with 10
threads each. Three of them started from the beginning and an additional instance would be started after the
restart of each of the two DB nodes. Thus, the test script was the following:

1. Start Oracle Console.

2. Start injection with 3 JMeter instance (10 threads each).

3. Execute the following command on DB 1:

Shutdown transactional

4. Start DB 1.

5. Start injection with one additional JMeter instance (10 threads).

6. Execute the following command on DB 2:

Shutdown transactional

7. Start DB 2.

8. Start injection with one additional JMeter instance (10 threads).

The test results were as expected:

97 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

l Inflight transactions that have been started on the restarted node are expected to fail.

l All traffic will be moved to remaining nodes.

l Once the node has fully started, traffic will be balanced between the three nodes.

The timing of the DB node shutdowns were as follows:

DB node Stop time Command JMeter errors Adjusted errors Start time Command

Node 1 09:01 shutdown transactional 48 23 09:04 startup

Node 2 09:06 shutdown transactional 44 22 09:10 startup

10.1.1 Result sum m ary

The DB servers were removed from the server with a small disturbance and when started again, they were
successfully contributing to the cluster. The total transactions missing from the DB were 45, for both the
shutdowns.

10.1.2 Record count

The following table summarises the effect this test had in the application data.

Transaction Expected Actual Missing Total Adjusted total

CUSTOMER 6000 5992 8

ACCOUNT 12000 11979 21 82 45

DEPOSITS 12000 11947 53

Given that 8 customers failed, we expect that to be followed by 16 accounts. Since we had 21 missing
accounts, that means that we had an additional 21-16=5 failed transactions. Because 21 accounts are missing,
we expect this to trigger 21 missing deposits as well. Therefore, 53-21=32 missing deposits cannot be
explained. So the actual failed transactions due to the test were 8+5+32=45 and all the rest were due to the
application logic and the JMeter test script design.

98 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

10.1.3 Enterprise M anager DB nodes view.

Before DB 1 goes down

In this view we can see all 3 DB nodes actively serving the application.

After DB 1 goes down

The two remaining nodes take up the entire load.

After DB 1 is back up and DB 2 down

99 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Here we can see that DB1 is contributing to the cluster equally to DB 3. DB 2 is missing, but this does not stop
the service.

10.1.4 Error sum m ary

Below is an example error, recorded in JMeter.

10.1.5 System resources usage

During the test, the DB behaved well and balanced the load correctly. The DB capacity was quite big and thus
the stress was never in high levels.

100 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

DB 1

This is the full screenshot of the Enterprise Manager (EM) Console for DB 1 load. This is taken after the
restart of DB 1 (9:01 - 9:04). Observe that the node starts picking up load.

At 9:06 DB 2 goes down with no adverse effects to the DB 1 node. After DB2 come back up again at 9:10, the
operation is normal. The increase that takes place at around 9:14 is purely operational and not a consequence
of the DB 2 restart.

DB 2

101 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

This screenshot is from DB 2, after its restart (9:06-9:10). The utilisation diagram is similar to the DB 1, without
taking into account the missing load metadata, before the restart.

102 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

11 COB HA tests
Using the baseline database, failure tests were performed during the COB. The test involved:

l Killing of Application layer AS nodes, Admin server, Node Agent, IHS one by one but only when
previous kill has been restored.

l Shutting down one of the RAC nodes.

Expected result:

l COB is expected to take longer when app servers in app layer or DB nodes are killed.

l These failures should not stop COB from progressing and finishing successfully.

l When COB is finished, there should be no difference in database performance compared to the baseline
test.

1 1 .1 COB under failure condition


COB was started in servlet mode with two COB agents running on each one of the three application layer
nodes.

11.1.1 Kill procedure

cob_under_failure_condition.txt

Test-AT8 #1 > COB Failure


APPPR1
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:47:37 GMT 2018
10486206 eg15ph09CellManager01 dmgr
[eg15ph09:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/dmgr/ {print $2}')
Tue Jan 16 07:47:48 GMT 2018
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:47:52 GMT 2018
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:47:56 GMT 2018
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:48:02 GMT 2018
[eg15ph09:t24user] /Temenos $ $DMGR_HOME/bin/startManager.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Dmgr01/logs/dmgr/startServer.log

103 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

ADMU3100I: Reading configuration for server: dmgr


ADMU3200I: Server launched. Waiting for initialization status.
ADMU3000I: Server dmgr open for e-business; process id is 10223888
[eg15ph09:t24user] /Temenos $
[eg15ph09:t24user] /Temenos $
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/dmgr/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:48:47 GMT 2018
10223888 eg15ph09CellManager01 dmgr
[eg15ph09:t24user] /Temenos $
------

[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:49:19 GMT 2018
9503156 eg15ph09Node01 nodeagent
[eg15ph09:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Jan 16 07:49:27 GMT 2018
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:49:31 GMT 2018
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:49:39 GMT 2018
[eg15ph09:t24user] /Temenos $ $NODE_HOME/bin/startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent
ADMU3200I: Server launched. Waiting for initialization status.
ADMU3000I: Server nodeagent open for e-business; process id is 9765370
[eg15ph09:t24user] /Temenos $
[eg15ph09:t24user] /Temenos $
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:50:16 GMT 2018
9765370 eg15ph09Node01 nodeagent
------------

[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/AppSrv/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:51:07 GMT 2018
6750702 eg15ph09Node01 AppSrv01
[eg15ph09:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/AppSrv/ {print $2}')
Tue Jan 16 07:51:15 GMT 2018
[eg15ph09:t24user] /Temenos $ date;ps -ef | grep java | awk '/AppSrv/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:51:18 GMT 2018
10879470 eg15ph09Node01 AppSrv01
[eg15ph09:t24user] /Temenos $

=================
APPPR2:

[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:52:13 GMT 2018
8716644 eg15ph10Node01 nodeagent
[eg15ph10:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Jan 16 07:52:19 GMT 2018
[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:52:23 GMT 2018
[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:52:24 GMT 2018
[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:52:26 GMT 2018
[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:52:50 GMT 2018
[eg15ph10:t24user] /Temenos $ $NODE_HOME/bin/startNode.sh
ADMU0116I: Tool information is being logged in file

104 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent
ADMU3200I: Server launched. Waiting for initialization status.
ADMU3000I: Server nodeagent open for e-business; process id is 10354998
[eg15ph10:t24user] /Temenos $
[eg15ph10:t24user] /Temenos $
[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:53:27 GMT 2018
10354998 eg15ph10Node01 nodeagent
[eg15ph10:t24user] /Temenos $
----------

[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/AppSrv/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:53:54 GMT 2018
9634140 eg15ph10Node01 AppSrv02
[eg15ph10:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/AppSrv/ {print $2}')
Tue Jan 16 07:54:00 GMT 2018
[eg15ph10:t24user] /Temenos $ date;ps -ef | grep java | awk '/AppSrv/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:54:04 GMT 2018
6291928 eg15ph10Node01 AppSrv02
[eg15ph10:t24user] /Temenos $
=====================
APPPR3:
[eg15ph11:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:54:55 GMT 2018
5439912 eg15ph11Node01 nodeagent
[eg15ph11:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/nodeagent/ {print $2}')
Tue Jan 16 07:55:01 GMT 2018
[eg15ph11:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:55:12 GMT 2018
[eg15ph11:t24user] /Temenos $ $NODE_HOME/bin/startNode.sh
ADMU0116I: Tool information is being logged in file
/u01/3rdParty/As/IBM/WebSphere/AppServer/profiles/Custom01/logs/nodeagent/startServer.log
ADMU3100I: Reading configuration for server: nodeagent
ADMU3200I: Server launched. Waiting for initialization status.
ADMU3000I: Server nodeagent open for e-business; process id is 9503016
[eg15ph11:t24user] /Temenos $
[eg15ph11:t24user] /Temenos $
[eg15ph11:t24user] /Temenos $
[eg15ph11:t24user] /Temenos $
[eg15ph11:t24user] /Temenos $ date;ps -ef | grep java | awk '/nodeagent/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:55:51 GMT 2018
9503016 eg15ph11Node01 nodeagent
----------------------------------
[eg15ph11:t24user] /Temenos $
[eg15ph11:t24user] /Temenos $ date;ps -ef | grep java | awk '/AppSrv/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:56:00 GMT 2018
7864616 eg15ph11Node01 AppSrv03
[eg15ph11:t24user] /Temenos $ date;kill -9 $(ps -ef | grep java | awk '/AppSrv/ {print $2}')
Tue Jan 16 07:56:08 GMT 2018
[eg15ph11:t24user] /Temenos $ date;ps -ef | grep java | awk '/AppSrv/ {print $2,$(NF-1), $(NF)}'
Tue Jan 16 07:56:11 GMT 2018
7864618 eg15ph11Node01 AppSrv03
[eg15ph11:t24user] /Temenos $

=================================
DBPR1 : ~ couple of minutes after APPPR3.
[LIVE-DB3 root@eg15ph18 /]# su - oracle

105 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

[LIVE-DB3 oracle@eg15ph18 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.1.0 Production on Tue Jan 16 07:57:00 2018

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options

SQL> select 1 from dual


2 ;

1
----------
1

SQL> shutdown abort;


ORACLE instance shut down.
SQL>
SP2-0042: unknown command "" - rest of line ignored.
SQL> select 1 from dual;
select 1 from dual
*
ERROR at line 1:
ORA-01034: ORACLE not available
Process ID: 0
Session ID: 0 Serial number: 0

SQL> select 1 from dual;


select 1 from dual
*
ERROR at line 1:
ORA-01034: ORACLE not available
Process ID: 0
Session ID: 0 Serial number: 0

SQL> startup;
ORACLE instance started.

Total System Global Area 1.0255E+10 bytes


Fixed Size 2764064 bytes
Variable Size 1811940064 bytes
Database Buffers 8153726976 bytes
Redo Buffers 286781440 bytes
Database mounted.
Database opened.
SQL> select 1 from dual;

1
----------
1

SQL>

Started all TSM again!

106 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

=======================
DBPR2 : ~8:35 GMT

[LIVE-DB1 root@eg15ph16 /]# su - oracle


[LIVE-DB1 oracle@eg15ph16 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.1.0 Production on Tue Jan 16 08:34:29 2018

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options

SQL> select 1 from dual;

1
----------
1

SQL> shutdown abort;


ORACLE instance shut down.
SQL> select 1 from dual;
select 1 from dual
*
ERROR at line 1:
ORA-01034: ORACLE not available
Process ID: 0
Session ID: 0 Serial number: 0

SQL> startup;
ORACLE instance started.

Total System Global Area 1.0255E+10 bytes


Fixed Size 3550496 bytes
Variable Size 2013266656 bytes
Database Buffers 7952400384 bytes
Redo Buffers 285995008 bytes
Database mounted.
Database opened.
SQL> select 1 from dual;

1
----------
1

SQL>

===================================
DBPR3: ~8:35 GMT

[LIVE-DB2 root@eg15ph17 /]# su - oracle


[LIVE-DB2 oracle@eg15ph17 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.1.0 Production on Tue Jan 16 08:34:34 2018

107 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Advanced Analytics and Real Application Testing options

SQL> select 1 from dual;

1
----------
1

SQL> shutdown abort;


ORACLE instance shut down.
SQL> select 1 from dual;
select 1 from dual
*
ERROR at line 1:
ORA-01034: ORACLE not available
Process ID: 0
Session ID: 0 Serial number: 0

SQL> startup;
ORACLE instance started.

Total System Global Area 1.0255E+10 bytes


Fixed Size 2764064 bytes
Variable Size 1811940064 bytes
Database Buffers 8153726976 bytes
Redo Buffers 286781440 bytes
Database mounted.
Database opened.
SQL> select 1 from dual;

1
----------
1

SQL>

Kill process Kill time Up time

AppDM 11:27:43 11:28:23

AppNA1 11:29:24 11:31:41

AppAS1 11:34:12 11:34:14 (TSM @ 11:36:58)

AppNA2 11:32:01 11:32:29

AppAS2 11:40:22 11:40:32 (TSM @ 11:41:41)

108 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Kill process Kill time Up time

AppNA3 11:32:55 11:33:24

AppAS3 11:46:15 11:46:22 (TSM @ 11:47:25)

In the above table we show the time at which each process was killed and the time it was back up running.
COB is driven by threads running inside WebSphere at the application layer.

When the Application Server process is killed and automatically restarted, the Temenos Service Manager
(TSM) needs to be manually restarted as well, or else COB will have no workers to make progress. That is
captured in the table above by signifying the time TSM was started (i.e. by “TSM @ HH:MM:SS”).

11.1.2 Result sum m ary


COB start time COB end time Duration Baseline duration Comments

11:26:24 11:57:52 31 mins 28 secs 31mins 2secs Times extracted from como file.

All Core Banking Services and indeed COB, are very resilient and flexible. Agents can be started automatically
or on demand with no disturbance in the process.

As was expected, COB finished successfully with no errors and no differences in the record count, although
it took slightly longer because of the lost time that resulted from the restart of the agents.

11.1.3 System resources usage

During this test the load was not coming from the Web Server / MQ, but was generated by tSA agents, which
were running in the JVM process raised for WebSphere. As a consequence, DM and NA do not have any
impact on the operation of the COB.

Failure of AppDM

The COB is not affected at all. This means that CPU, memory and network are not affected as well. For
example note in the following diagram the CPU activity for the duration of the DM failure:

109 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

There is a spike, but this is hardly related to an extraordinary event. Rather, it is part of normal ups and
downs in COB activity, due to changing from CPU intensive to DB intensive tasks.

The only diagram that signifies an exceptional event, is the disk activity. COB is not very intensive on the disk
of the application server, so re-initialising DM produces a disk activity that stands out:

110 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Failure of AppNA 1,2 and 3

Similarly to the DM failure, NA failure does not affect COB to the slightest. Moreover, disk activity is also not
significant, because the NA by nature is not a heavy process. The only impact in the system is the somewhat
increased process switching during the failure. The following diagram shows the process switches diagram
for App 1.

The blue circle shows the NA failure. The left spike (orange circle), is due to the DM failure. The right spike
(orange circle) is due to the AS failure that will be discussed in Failure of AppAS 1, 2 and 3.

Failure of NA in App 2 exhibits similar symptoms:

111 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

The blue circle shows the NA failure. The orange one, is due to the AS failure that will be discussed in
Failure of AppAS 1, 2 and 3. Finally, we can easily recognise now the failure of NA in App 3:

The blue circle shows the NA failure. The orange one, as usual, the AS failure.

112 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Failure of AppAS 1, 2 and 3

By killing the AS, JVM is killed and all threads are killed as well. This means that all COB services die
unexpectedly. This is directly affecting execution of COB, which is interrupted in the affected server and all
uncommitted transaction blocks due to the unexpected termination of the particular COB services, are rolled
back in the DB. The following graphs shows the App 1 resource impact:

The circled area is the part of the diagram that shows the AS failure. We can see that Temenos Core Banking
services die, and after some brief CPU activity due to the automatic restart and initialisation of AS, the server
stops contributing to the COB service. Note that COB carries on normally because of the agents running in
the rest of the servers in the cluster. After the services manager (TSM) is restarted, we do not observe
immediate activity. This is because TSM takes some time to raise back the required COB agents.

Network utilisation tells the same story:

113 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

A similar picture is observed in App 2 when the server goes down:

114 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Finally, the App 3 behaviour:

115 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

The effects on the DB are not severe, since in each failure scenario, only 1/3 of the agents are affected. CPU is
not affected much because the remaining COB agents continue to compete for faster execution. For example
the following is the CPU diagram of DB 1. Similar is the picture on DB 2 & 3. Note the insignificant CPU
fluctuations during the three AS failure events:

116 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Network is much more affected, since all transaction inflow from 1 of the 3 servers is interrupted. All three
DB servers have similar network utilisation diagrams:

117 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

In all three DB servers there is an impact on network traffic, but nothing significant regarding CPU and
memory.

118 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

1 1 .2 COB with database VM restart


Restarting a DB VM did not have any impact on COB running as incomplete jobs resumed processing on the
remaining nodes. In this test there were two scenarios:

1. DB 1 goes down and stays down for the duration of the COB.

2. DB 1 goes down and is restarted before the COB finishes.

Test COB start time COB end time Duration Baseline duration

1 12:37:26 13:04:28 27 mins 02 secs 31 mins 02secs

2 13:26:34 13:55:24 28 mins 50 secs 31 mins 02secs

We did not expect to see COB finishing earlier than the baseline test. When COB is executed, the ordering on
the COB jobs is determined at the start of COB. This means that running COB several times, even with an
identical restored DB, under similar conditions, the duration is expected to vary in the range of a few
minutes.

The difference of 2 mins for test #2 can be easily explained by this variation in the COB duration. The
difference of 4 mins for test #1 is more significant. Since during this test one DB node was down for most of
the duration of the test, some of the time difference could be attributed to the decreased cluster
communication overhead between the DB nodes. This is an assumption which could not be verified.

11.2.1 System resources usage

Both tests (shutdown and restart) were successful. The resource utilisation for the active DB nodes does not
change noticeably when the targeted node goes offline.

DB node 2 goes offline during COB

Both CPU and network demonstrate well the failure event. The CPU utilisation of DB 1 is shown by the
following graph:

119 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

The red vertical line shows the time of DB 2 going down. DB 2 on the other hand loses most CPU activity
after the time of the event:

120 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

We don't need to show the CPU utilisation diagram for all the Application servers - all of them are similar to
the following (App 1):

In all diagrams, both in DB and App, we notice a dip in activity after two minutes from the DB 2 going down.
This is probably due to rolling back the transactions submitted to DB 2, which in turn temporarily blocks
normal progress of COB.

DB node 2 restarts during COB

Both CPU and network demonstrate well the failure event. The CPU utilisation of DB 1 is shown by the
following graph:

121 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

The red line represents the time DB 2 goes offline and the green line the time it goes back online. DB 2 on the
other hand loses most CPU activity after the time of the event, until the restart:

122 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

There is no reason of showing the CPU utilisation diagram of all Application servers; all of them are similar to
the following (App 1):

The peak in activity at around 13:38 seems too much after the shutdown of DB 2, to be caused by it. Most
probably this is purely operational (for example, a multithreaded, CPU intensive job). This fluctuation is not
reflected in the DB nodes.

123 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

12 DR tests with online traffic


1 2 .1 Site switchover
l Generate traffic with JMeter and let it run for 10 minutes.

l Execute switchover.

l Diversion of online traffic is done by updating the “host” file in the JMeter server, to point to the DR
Load Balancer.

12.1.1 Switchover procedure


Switchover start Switchover finish Downtime (sec) Total service unavailability (sec)

14:40:16 14:42:45 149 ~270

12.1.2 Result sum m ary

The JMeter results are not important in this particular test, as the errors were dependent only on the duration
of the waiting time during the manual switchover. All transactions before and after the switchover were
successful.

12.1.3 Transactions per second

All failures took place during the switchover, as shown by the JMeter TPS diagram.

124 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

12.1.4 DB Records sum m ary

The missing records exact number is not important in this test. The longer we chose to wait for the manual
switchover, the longer the online traffic would keep failing.

Please note that the adjusted total in this case should be about the same as the total, because although there
was a business reason why some transactions would fail because of a missing customer, in this case we know
that communications were down so all failures were because of that.

Transaction Expected Actual Missing Total Adjusted total

CUSTOMER 6000 5116 884

ACCOUNT 12000 10232 1768 4423 ~4423

DEPOSITS 12000 10229 1771

12.1.5 System resources usage

Since the traffic is diverted from the live site to the DR site, the resource utilisation diagrams are
complementary; on the left side we have the live servers and on the right the DR ones.

For brevity and because of the Live - DR topology differences, only the primary cluster server from each layer
is compared. The red line on the left diagram is the live site switchover and the green line on the right, is the
DR switchover event.

125 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Load balancers

Web servers

126 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Application servers

DB servers

1 2 .2 Site failover
Site failover test summary:

127 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

l Generate traffic with JMeter and let run for 10 minutes.

l Simulate failure: run srvctl stop database –d t24db –o abort, wait a couple of minutes and then
execute failover.

l Diversion of online traffic is done by updating the “host” file in the JMeter server, to point to the DR
Load Balancer.

12.2.1 Failover procedure


srvctl stop database -d t24db -o abort

Failover start Failover finish Downtime (sec) Total service unavailability

17:11:20 17:13:42 142 242

12.2.2 Result sum m ary

The JMeter results are not important in this particular test, as the errors were dependent only on the duration
of the waiting time during the failover. The important element of this test is that the failover took place and
normal operation continued with the help of the DR site.

12.2.3 Transactions per second

After the failover, normal operations resumed.

128 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

12.2.4 DB records sum m ary

The missing records exact number is not important in this test. The longer we chose to wait for the failover,
the longer the online traffic would keep failing.

Note that the adjusted total should be about the same as the total. Although there's a business reason why
some transactions would fail if there was a missing customer, in this case we know that almost all the failures
were caused by the fact that communications were down.

Transaction Expected Actual Missing Total Adjusted total

CUSTOMER 6000 5880 120

ACCOUNT 12000 11664 336 916 ~916

DEPOSITS 12000 11540 460

12.2.5 System resources usage

This test is quite similar to the switchover regarding the resource utilisation. Again, as the traffic is diverted
from the live site to the DR site, the resource utilisation diagrams are complementary - on the left side we
have the live servers and on the right the DR servers.

129 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

For the sake of brevity and the live vs DR topology difference, only the primary cluster members from each
layer are compared. The red line on the left diagram is the live site switchover and the green line on the right,
is the DR switchover event.

Load balancers

130 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Web Servers

Application servers

131 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

13 DR tests with COB


1 3 .1 Site switchover
The objective of this test was to execute a switchover during the COB. The expected result was that COB
would be able to finish normally and with no errors on the DR site.

COB was started in servlet mode with two agent running on each of the three live site node. At 15.93%
during the System Wide Stage, switchover was initiated without stopping COB.

After the switchover finished, we raised in servlet mode two COB agents on each of the two application
servers of the DR site. The COB continued normally and with no errors until its completion.

COB start COB stop Switchover Switchover COB


Stage TSM/COB
time time start end finished

10:10 10:16 System wide 15.93% 10:17:56 10:20:30 10:22 10:47

The downtime for switchover was about 6 minutes.  Even though the actual switchover took approximately 3
minutes, the database needed at least 2 minutes to get itself into functioning state.

132 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

13.1.1 System resources usage

When running COB, only the Application and Data Layers are utilised, so that is what we are going to
concentrate on. For brevity and because the live vs DR topology difference, only the primary cluster servers
from these two layers are compared.

Application Servers

Database Servers

133 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

1 3 .2 Site failover
The objective of this test was to execute a failover during the COB. The expected result was that the COB
would be able to finish normally on the DR site and with no errors.

13.2.1 Test procedure

COB was started in servlet mode with two agent running on each of the three live site node. At 72% during
the System Wide Stage, failover was simulated without stopping COB. After the failover finished, we raised
in servlet mode two COB agents on each of the two application servers of the DR site. The COB continued
normally and with no errors until its completion.

13.2.2 Test sum m ary


COB start Failover
Stage Failover start COB restart COB finished
time end

14:05 System wide 72% 14:21:20 14:24:21 14:25 14:35:59

134 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

13.2.3 CO B m onitor after failover

The image below shows COB resuming on the DR site after a successful failover process:

13.2.4 System resources usage

When running COB, only the Application and Data Layers are utilised, so that is what we are going to
concentrate on. For brevity and because the live vs DR topology difference, only the primary cluster servers
from these two layers are compared.

Application Servers

135 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

14 Scalability testing: adding physical nodes


Due to the limited resources, this test was performed by first taking down one of the nodes from both the
Web Layer and App Layer so that the architecture will look as if just two nodes from each layer were
configured. JMeter scripts were then run to inject transactions and then later restarted the standby Web and
App layer node.

Both tests scalability tests (i.e. adding a node in Web and App layers) were executed in the same test. The
setup of the test was with one instance of JMeter, pumping transactions with 10 concurrent users, each one
executing 200 full test cycles. One test cycle is a user login, creation of customer, foreign and local account,
etc. until the logout.

Node added When

Web 3 17:01:26

App 3 17:06:08

As we observed in previous tests, when we have graceful shutdown of servers, we did not have any missing
DB records. In this instance, we do not have a shutdown but an addition of a server to the cluster, but it
involves no sudden failure, forced restarts or anything other unexpected, so it falls under the same principle.

1 4 .1 Adding a Web layer node to the existing cluster

14.1.1 Test result

IHS was able to capture the new node and online traffic was balanced among all three nodes and there was
no impact on the existing traffic.

14.1.2 System resources usage

The Network activity demonstrates the load distribution and the behaviour of the cluster at the time of the
new server addition. The high peaks and low valleys in the following diagrams are part of the normal
operation during the test and they take place even before the node addition, so the reader should not
interpret them as consequences of the node addition.

136 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

This diagram shows the activity of the Web 2 node. For convenience there are two marks: the left one is the
Web 3 node addition and the right one is the App 3 node addition that is discussed in 14.2 Adding an App
layer node to the existing cluster. We get a similar picture from the Web 1 network activity diagram but for
clarity it is not presented here.

The following diagram is the network activity from the Web 3 node that was actually added at 17:01:26:

137 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

1 4 .2 Adding an App layer node to the existing cluster

14.2.1 Test result

The load from the web layer was uniformly distributed on all app servers in the app layer and there was no
impact on the existing traffic.

14.2.2 System resources usage

As with the test above, we present the network traffic diagram as the most revealing one. Following is the
diagram for App 2, which is part of the cluster for the duration of the test:

The high peaks and low valleys in the following diagrams are part of the normal operation during the test
and they take place even before the node addition, so the reader should not interpret them as consequences
of the node addition.

Following is the App 3 network activity diagram:

138 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

A short while after the node addition to the Application Layer cluster, it starts picking up load and
contributes as an equal member.

139 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

15 Glossary
Term Definition

App1, App2 and App3 are abbreviations for the Application Server 1, 2
Appx
and 3 respectively.

AS WebSphere Application Server

AS1, AS2 and AS3 are abbreviations for the WebSphere Application
ASX Server on Server 1, 2 and 3 respectively, on either the Web layer or App
layer.

BrowserWeb Temenos Web UI, used for accessing the Temenos Core Banking.

COB Close of Business

DB Database

DB1, DB2 and DB3 are abbreviations for the Database Server 1, 2 and 3
DBX
respectively.

DM WebSphere Deployment Manager

DR Disaster Recovery

EM Oracle Enterprise Manager

HA High Availability

IHS IBM HTTP Server

IHS1 and IHS2 are abbreviations for the IBM HTTP Server 1 and 2
IHSX
respectively.

JDK Java Development Kit

Abbreviation for WebSphere MQ (now IBM MQ). See WebSphere MQ


MQ
below.

NA WebSphere Node Agent

NA1, NA2 and NA3 are abbreviations for the WebSphere Node Agent on
NAX
Server 1, 2 and 3 respectively, on either the Web layer or App layer.

Oracle Database 12c is an enterprise-class database from Oracle. Its


Oracle 12c
features include pluggable databases and multitenant architecture.

140 C r eated by JumpS t art


I B M  S t a c k 4 R e f e r e n c e A r c h i t e c t u r e - H A D R  T e s t R e p o r t 1 . 1

Term Definition

QM Queue Manager

QM 1 and QM 2 refer to Queue Manager, running in MQ host 1 and MQ


QMX
host 2 respectively.

T24 T24 was the initial name for Temenos's Core Banking solution.

T24Browser Web User Interface for Temenos Core Banking.

TAFJ Temenos Application Framework Java

TPS Transactions per second

Core Banking Service Agent. This is a working thread that can be


tSA
assigned jobs by the TSM (see below) to run in the background.

Core Banking Service Manager. This is the manager process of all tSA
TSM
(see above) for a particular Application Server instance.

UI User Interface

VM Virtual Machine

Web1, Web2 and Web3 are abbreviations for Web Server 1, 2 and 3
WebX
respectively.

WebSphere Application Server (WAS) is an IBM web application server


WebSphere product. It is a software framework and middleware that hosts Java
based web applications.

Renamed IBM MQ in 2014. IBM's enterprise messaging solution allows


WebSphere MQ independent and potentially non-concurrent applications in a distributed
system to communicate securely with each other.

141 C r eated by JumpS t art

S-ar putea să vă placă și