Lab 4

CS294-1 (RADS) Fall 2006
10/13/06
Andrew Dahl
Jeremy Schiff
Jesse Trutna
Lab 4: Failover with HAProxy
Methods
In this lab we combined our best web farm configuration from Lab 3 with HAProxy to
facilitate both load balancing and failover. This required us adding HAProxy to the VM running
Lighthttpd as well as a dedicated failover machine running a set of backup dispatchers,
Memcached, and MySQL server. The final configuration consisted of the following:
VM51:
Lighthttpd
HAProxy
VM52 & VM 53:
3x Dispatchers per server
VM54 & VM55:
MySQL server
Memcached
VM56 & VM57:
Load generators
VM50 (Backup):
3x Dispatchers
MySQL server
Memcached
(See appendix from web farm configuration diagram)
To test the load balancing and failover mechanisms of HAProxy we started by loading our
web farm with researchindex_load and successively bringing down each of the main dispatch and
database servers. After it was established HAProxy was running correctly we then ran a baseline
test using the same methods as in Lab3: researchindex_load was run on two load servers
simultaneously with a variable number of concurrent users: 1, 5, 10, 25, 50, 100, 500, 1000 (per
load server). This run was then repeated twice, tfirst o simulate failure of a dispatch server and a
then to simulate the failure of a database server, each being brought down after 25 concurrent
users.
Optimizations
Initially we were having troubles getting HAProxy to work correctly and were also getting
a large number of 500 errors from researchindex_load when trying to generate traffic. After
(~10) hours of tweaking the system appeared to be operating in a relatively stable fashion. The
following changes we made:
Added garbage collection to the dispatchers located in public/dispatch.fcgi on all
dispatch servers. This was accomplished by changing the following:
RailsFCGIHandler.process! ! RailsFCGIHandler.process! nil, 10
This appeared to solve many of the 500 errors generated and kept dispatchers alive. In
fact, the opposite problem then arose. Occassionaly, under heavy load, the dispatchers would
become unresponsive and unkillable.
Turned off Ajax in the researchindex_load script, which was commented #partially
working by making the following change:
cfg_forms.use_ajax = true ! cfg_forms.use_ajax = false
Tweaking HAProxy settings, especially timeout intervals and number of connection
retries. A copy of our HAProxy configuration file is presented in the appendix.
Removed a call in researchindex_load to parse what it defined as a partial page, This
method was not defined in the file and when called generated an error. We dont this
method call participated in generating 500 errors, at least in most cases, but was fixed
for safety.
Results
HAProxy test
1) Normal operation
2)
This picture depicts our complete web farm configuration with a load of 100
concurrent users. In GREEN you see the primary dispatchers, Memcached and
MySQL servers; in BLUE their corresponding backups.
3) 1x Dispatch server and 1x MySQL/Memcached server down
Here we have taken a dispatch server and one MySQL/Memcached server
down. You can see that the excess load has been taken over by the remaining
servers. Note however, that the backup servers (in BLUE) are not yet used! They
will only be put into commission when all normal servers are down (this is how
HAProxy works).
4) All normal servers down (except Lighttpd/HAProxy server)
Now all normal servers are down except the server running Lighthttpd and
HAProxy. As you can clearly see from the picture all load has now been shifted
from the servers that are down (in RED) to the backup servers (in BLUE). We
should note that researchindex_load is now generating errors due to timeouts as
a result of the backup servers being overloaded.
5) Return to normal operation
Finally we bring all servers back up and we see that load has correctly shifted
to the normal servers from the backups. As can also be seen above, a
noticeable delay was present between the detection of a server failure or
resumption and the redirection of traffic. Additional configuration helped, but
never entirely resolved this problem. It is also possible that the HAProxy
reporting tool itself was not updating its statistics in real time and hence
contributed to as a source of these apparent delays, but this was difficult to
confirm or deny in practice. Load was distributed appropriately, neglecting these
small delays.
Baseline run
# of Concurrent Users vs Response Time (as seen by user)
0
0.5
1
1.5
2
2.5
3
1 10 100 1000
# of concurrent users
r
e
s
p
o
n
s
e

t
i
m
e
VM56
VM57
HAProxy does not seem to be incurring any additional overhead, at least not a
measurable one. In fact looking at the response times compared to Lab 3, these
seem to be slightly better. We might attribute this to HAProxy performing better
load balancing than Lighthttpd and/or some of the changes made as mentioned
in the Optimizations section. In particular, the removal of the ajax calls from
researchindex_load might contribute greatly to reducing the per-user load on the
server. Due to the difficulty of debugging some of the application problems we
were experiencing, we were not able to determine the exact impact of this
change.
# of Concurrent Users vs # of Errors
-2000
0
2000
4000
6000
8000
10000
12000
0 200 400 600 800 1000 1200
#

o
f

e
r
r
o
r
s
VM56
VM57
There seems to be nothing unusual here. As expected from previous labs we see
errors increasing linearly after some initial load threshold is reached and the
number of concurrent users increases. In addition, both failure cases exhibited
nearly identical error graphs and were thus excluded. It is interesting that all the
error graphs matched so nearly, and in future work, it would be illuminating to
have a better understanding and classification system for errors since this is not
the expected behavior.
# of Concurrent Users and VM vs Processing Time (Seen by
Dispatchers)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
50 52 53 50 52 53 50 52 53 50 52 53 50 52 53 50 52 53 50 52 53 50 52 53
1 1 1 5 5 5 10 10 10 25 25 25 50 50 50 100 100 100 500 500 500100010001000
vm & # of concurrent users
p
r
o
c
e
s
s
i
n
g

t
i
m
e
Render Time
Controller Time
Database Time
This graph shows, as discussed in previous labs, that the database is where
most of the processing time is spent. This would indicate that either reducing the
number of queries made or optimizing the current ones, could give significant
performance improvements.
Failure Runs
(Summary)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
1 10 100 1000
r
e
s
p
o
n
s
e

t
i
m
e
Kill Database
Kill Dispatcher
Baseline
Kill Database (Adjusted)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
10 20 30 40 50 60 70 80 90 100
r
e
s
p
o
n
s
e

t
i
m
e
Kill Database
Kill Dispatcher
Baseline
Kill Database (Adjusted)
In each of the failure runs, the failing server was killed after the
completion of the 25 user run. The 50 user run was then started. As can
be seen in the above graphs, both failure cases show an approximately
proportional increase in response time over the 25-100 user case, as
expected. Unexpected results for the database and dispatcher failure
cases are explored further in the following sections. Due to the latency of
the shutdown command, it is possible that some 50 user requests were
dispatched to the failing server.
Failover run (Dispatcher)
# of Concurrent Users vs Response Time (seen by user)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 10 100 1000
r
e
s
p
o
n
s
e

t
i
m
e
vm56
vm57
Graph 1:
In this run we kill a dispatch server right before 50 concurrent users are run
through the system (50 users per load server). At this point, response time
increases proportionally as compared to baseline. At around 100 concurrent
users we see the response time drop off slightly and then increase again after
500 users. If This happens because the single dispatch server is being
overloaded and starts to drop requests. This is evident in the error graph and
processing times shown and described below.
# of Concurrent Users vs Processing Time (seen by dispatchers)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
50 52 53 50 52 53 50 52 53 50 52 53 50 52 53 50 52 53 50 52 53 50 52 53
1 1 1 5 5 5 10 10 10 25 25 25 50 50 50 100 100 100 500 500 500100010001000
VM and # of concurrent users
P
r
o
c
e
s
s
i
n
g

T
i
m
e
View
Controller
Database
Looking at the breakdown in processing time we see that it grows until around 50
concurrent users and then as a dispatcher is killed starts to become less and
less. This is likely caused by the dispatcher dropping requests or returning error
messages. Also, it is important to note that for some reason, vm53 came back
online after being shutdown. Possibly due to someone accidentally restarting the
vm and it thus started servicing requests. This is almost certainly the cause of
the dip in service time seen around the 500 user mark in the response time chart.
Failover run
(Database)
# of Concurrent Users vs Response Time (seen by user)
0
0.5
1
1.5
2
2.5
3
3.5
4
1 10 100 1000
p
r
o
c
e
s
s
i
n
g

t
i
m
e
VM56
VM57
The results for the database/memcached failure case indicate similar trends to
the dispatcher failure case over the 25-100 user region, with a proportional
increase in the response time. The overall times are much higher as
demonstrated on the summary graph. This was likely caused by resource
contention as it is possible that another team began utilizing these machines
while our tests where running. In addition, this test was run immediately after the
dispatcher failure test and it is possible some of the dispatchers/HAProxy
connections had not timed out yet, resulting in extremely long processing times
until the old connections were flushed. The overall trends remain similar, and
match well if adjusted for the slowdown.
VM & # of Concurrent Users vs Processing Time (Seen by
dispatchers)
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
50 52 53 50 52 53 50 52 53 50 52 53 50 52 53 50 52 53 50 52 53 50 52 53
1 1 1 5 5 5 10 10 10 25 25 25 50 50 50 100 100 100 500 500 500100010001000
VM & # of concurrent users
P
r
o
c
e
s
s
i
n
g

T
i
m
e
Render
Controller
Database
Conclusions
In this lab we succesfully setup HAProxy and a backup server to act as a failover for
dispatch, MySQL, and Memcached servers. From this exercise we learned a number of
things:
1. HAProxy does not seem to incur a substantial overhead (if any) to the system as opposed
to our web farm configuration from Lab 3. It might actually reduce response times due to
better load balancing.
2. Failover with HAProxy works as demonstrated in the "HAProxy test" section. We did
however find that it takes HAProxy a considerable amount of time to probably rebalance load
when servers go up and down. This could be on the order of minutes. Of course this could be
a configuration error, however, even with hours of tweaking HAProxy we could not seem to
improve the failover times.
3. In both cases of server failure, response time increases roughly linearly after the initial
failure. After signifigant load is applied, the small processing time for errors and anomalous
in experiment parameters create unexpected results.
4. In time and load sensitive distributed systems such as web farms, error pathology and
debugging systems are absolutely critical. Without access to correlated metrics, it is
extremely difficult to draw strong conclusions or isolate errors. In this report, the errors
reported by researchindex_load did not change over the various runs, even at saturation
levels. This is a strong indicator that a more detailed analysis of the specific errors that
occurred is needed.
Appendix
Web Farm Configuration
HAProxy Configuration
<to be included upon availability of server>

Lab 4

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Lab 4

Încărcat de

Drepturi de autor:

Formate disponibile

CS294-1 (RADS) Fall 2006

S-ar putea să vă placă și