Sunteți pe pagina 1din 20

BIG-IP LTM Architecture and High

Availability
Architecture Design Considerations
Not Meshed
Thisexampleillustratesstandardconnectivity.HereSwitch1(SW01)andSwitch2
(SW02)areconfiguredasregulardeviceports(NotMeshed).

1. In this example, Spanning Tree is not configured.


Port b (SW02) is an individual network segment.

Port a is VLAN 0 and Port b is VLAN 1.


From the perspective of each BIG-IP, each of these networks (a for BIG-IP1
and b for BIG-IP2) are available.
From the BIG-IP perspective, only the Active unit is replying to Address Resolution
Protocol (ARP) requests for Virtual Servers, Source (or Secure) Network Address
Translation (SNATs), Network Address Translation (NAT) and Floating Self IPs.
BIG-IP1 is the Active Local Traffic Manager (LTM).
Assuming a failover from BIG-IP1 to BIG-IP2, BIG-IP2 would send out a Gratuitous
ARP to force the switch to begin forwarding incoming traffic on Port b. Failover
happens without the switches needing to recalculate STP.
It is recommended that MAC (Media Access Control) Masquerade be configured on the
BIG-IP to eliminate any potential ARP cache issues. It is also recommended that VLAN
Fail-Safe be enabled so LTM can failover in the event of a switch failure.
The BIG-IP (in this design) with Serial Failover enabled, offers a very fast failover time,
and coupled with session mirroring causes the traffic to continue without failing. In some
applications and topology configurations, the application may experience a 1 -3 second
pause before continuing. This is acceptable for most protocols.

When BIG-IP executes a failover to BIG-IP2. This causes BIG-IP2 to begin forwarding
Instructor Notes:
traffic almost instantly through port b on SW02.

Remind: However, if SW01 goes down for any reason, traffic will not failover to SW02 unless
VLAN Fail-Safe is enabled on LTM. Also note that if any device(s) on the connected
Explain: segments do not honor our Gratuitous ARP and update their cache, MAC Masquerade
must be configured on all connected VLANs.
Ask: Summary
While this is the recommended configuration, an additional consideration could include
using a minimum of two ports per switch in an aggregated channel such as, Link
Aggregation Control Protocol (LACP) as a way to mitigate individual port or cable
failures.
Disadvantage(s)
1) Some Network Administrators prefer meshed environments.
2) MSTP failover is relatively quick if setup correctly creating design strife between
architects.
3) LTM is fully responsible for failover so configuration specifics and monitoring
are critical.
Advantage(s)
1) STP is the most common deployment. It uses LTM failover mechanisms.
2) STP provides the most reliable failover process with predictable traffic flows.
3) STP provides a simplistic troubleshooting environment.
4) LTM will failover with network or switch failure.
Meshed
ThisexampleillustratesaSpanningTreewithCiscoCatalystwhereSW01isconfigured
astherootbridge(FullyMeshed).

1. In this example, the root bridge determines that Port a (SW01) is in blocking. This
means that Port a is forwarding from the switch perspective, and is talking to two
both BIG-IP1 and BIG-IP2.
From the perspective of each BIG-IP, Port a (BIG-IP1) is available and Port b
(BIG-IP2) is not available.
Also, from the BIG-IP perspective, the only unit that is passing traffic is the active
one, which means production traffic passes through switch Port a.
Note: That although no production traffic is passing through the standby unit (BIG-IP2),
Bridge Protocol Data Unit (BPDU) traffic will be forwarded through the interfaces as
required to maintain proper STP configuration.
Turning off STP pass through on BIG-IP1 and BIG-IP2 will cause the switches to change
Port b from blocking to forwarding, immediately causing bridge loops and network
failure due to broadcast floods. It should also be noted that BPDUs do not pass across
more than one 802.1q tagged VLAN as BPDUs are not 802.1q compliant.
2. Assume a failover from BIG-IP1 to BIG-IP2. When this happens, BIG-IP2 sends
out a Gratuitous ARP to force the switch to begin forwarding incoming traffic on
Port b instead of Port a.
From the switch perspective, we assume that neither of the ports is down and in
forwarding mode, so the failover happens without the switches needing to
recalculate STP.
The BIG-IP with the hardware failover offers a very fast failover time, and, coupled with
session mirroring, causes the traffic to continue without failing. In some applications and
topology configurations, the application may experience a 1-3 second pause before
continuing, which is acceptable for most protocols.
A point to consider here, is lets say that after failover has occurred and traffic is going
through BIG-IP2, BIG-IP1 loses its link on Port a. This will cause the switch to
recalculate STP so that Port b is placed in blocking and Port a is placed in forwarding.
However, although no production traffic is going through BIG-IP1, this does not interrupt
the normal flow of traffic.
With STP active on both Cisco switches, and with SW01 configured as the root bridge,
the links between SW01 and SW02 become the root path and SW02 Port b changes its
status to blocking mode.
During a BIG-IP failover, there will be a seamless transition from BIG-IP1 to BIG-IP2, or
from ports a to b.
Note: Port b on SW02 is in blocking mode and because Spanning Tree is not triggered,
failover happens instantly.
If there is a link failure between a and/or b Spanning Tree will begin to recalculate. STP
requires recalculation time, which by default is 30 seconds, (15 seconds for listening, and
15 seconds more for learning). There may also be some additional seconds of delay due
to ARP packets sent by BIG-IP to discover Layer 2 information before it starts
forwarding packets.
These processes can take up to 40+ seconds to complete. Forty+ seconds of traffic delay
will definitely impact application timers used to keep sessions open, therefore requiring
clients to restart sessions.
Summary
Under certain conditions, seamless port failover is not available with this topology.
Disadvantage(s)
1) Fully meshed is not a common deployment.
2) Traffic flows are difficult to predict, since the Network Protocol is controlling
how traffic passes regardless of the LTM condition.
3) Fully meshed creates complex troubleshooting environments.
4) Adds an extra layer of complexity
Best Practices for Insuring Consistent Fail-Over

The F5 BIG-IP Local Traffic Manager provides many options which can be configured to
optimize failure detection and recovery. However, there are a few basic tenets that should
be observed in most, if not all, environments.

1. Always create both primary and secondary failover addresses.

a. The secondary addresses will be used in the event that the primary addresses
are unavailable for any reason.

2. The failover addresses should be on separate VLANs.

a. An appropriate naming convention also helps to identify the assigned VLANs,


i.e., PeerNet1 for the primary and PeerNet2 for the secondary.

3. Failover addresses do not need to be public IP addresses, nor do the networks need to
be expansive. A 30-bit mask will suffice for these point-to-point connections.

For example:

4. Additional fault tolerance can be achieved by trunking two interfaces to create the
peering networks, and configuring Link Aggregation Control Protocol (LACP)
Link Aggregation Control Protocol (LACP) Configuration Tips

1. The interfaces that you specify for an assigned trunk must operate at the same media
speed, and must be set at full-duplex mode, and any interface that you assign to a trunk
must be an untagged interface.

2. Active mode (default) will cause the BIG-IP to periodically issue LACP control
packets at the interval specified by the configured timeout value (Short = 1 second, and
Long = 30 seconds).

a. We generally recommend that you leave the mode set to Active, however, if
you set the LACP mode to Passive, do so only on one peer system. If you set both
systems to Passive mode, no control packets will be sent.

3. F5 Networks has enhanced the 802.3ad specification for LACP by adding a Link
Selection Policy option.

a. When you set the link selection policy to Auto, the system then aggregates any
links that have the same media properties and are connected to the same peer as
the reference link.

b. When you set the link selection policy to Maximum Bandwidth, the BIG-IP
system aggregates the subset of member links that provide the maximum amount
of bandwidth to the trunk.
5. It is generally recommended that you allow the default services on your PeerNet(s).

Serial vs. Network Failover Options

In most circumstances, serial failover can be used to provide fast recovery in the event of
a configured failover trigger. However, if the distance between the BIG-IP LTM units is
greater than 50 feet, you cannot use a serial failover cable. In these circumstances,
Network Failover is your only option.

Network failover uses the network to communicate the status of the units. An internal
timer allows the LTM to communicate problems that would prevent paired units from
communicating their status to each other for a brief period of time (for example, a small
number of packets dropped or a brief network interruption). The timer is configured via
an internal database key (Failover.NetTimeoutSec) , and controls the amount of time a
standby unit will wait before going active if the communication from active unit stops.
a. Failover.NetTimeoutSec = the number of seconds the unit will wait for an
update from the Active unit from its last received update. The default setting is
three seconds (use bigpipe db Failover.NetTimeoutSec <#> to alter value).

Serial cable failover is based on heartbeat detection, where voltage is continuously sent
from one BIG-IP Controller to another. If a response does not initiate from one BIG-IP,
failover to the peer will occur in less than one second, so it provides the fastest recovery
in the event of system failure.

Network failover is also based on heartbeat detection, but instead of using the serial
cable, heartbeat packets are sent over the internal network on ports 1028. If a response
does not initiate from one BIG-IP, failover to the peer will occur in approximately 5
seconds by default.

If the BIG-IP is configured with both serial cable failover and network-based failover,
then the serial cable signal and the network heartbeat must both fail before the standby
BIG-IP will become active.

A potential problem with network failover is that network problems may cause both LTM
units to enter into active mode. To avoid this serious issue, we recommend that if you use
network failover, you crossover one interface on each unit to perform only failover
communications. The self IPs on this interface will require Custom Port Lockdown.
Network Failover and the Management Interface

F5 Networks has never recommended that the management interface is used for peering
(configsync), network failover, or state mirroring functions. While it is not recommended,
note that you can configure the primary and secondary failover addresses to use the
management IP address in BIG-IP versions 9.1.1 and earlier only.

Beginning with BIG-IP versions 9.1.2 and 9.2, however, you can no longer configure the
primary and secondary failover addresses used for peering, network failover, and state
mirroring, to use the IP addresses used for the management interface.

Active/Active vs. Active/Standby

Your choice of redundancy mode can have a great deal of impact on operations, as well
as ongoing LTM management options. In the default Active/Standby mode, LTM unit 1
is in an active state, and unit 2 is in a standby state. With this configuration, failover
causes the following to occur:

a. Unit 2 switches to an active state.

b. Unit 2 begins processing the connections that would normally be processed by its
peer.
Unlike an active/standby configuration, which is designed strictly to ensure no
interruption of service in the event that a BIG-IP LTM system becomes unavailable, an
active/active configuration has an additional benefit. An active/active configuration
allows the two units to simultaneously manage traffic, thereby improving overall
performance. However, there are several caveats to running in this redundancy mode.

A common active-active configuration is one in which each unit processes connections


for different virtual servers. For example, you can configure unit 1 to process traffic for
virtual servers A and B, and configure unit 2 to process traffic for virtual servers C and D.
If unit 1 becomes unavailable, unit 2 begins processing traffic for all four virtual servers.

Here is an active-active configuration, first as it behaves normally, and then after failover
has occurred:
The figure above shows an active-active configuration in which units 1 and 2 are both in
active states. With this configuration, failover causes the following to occur:

a. Unit 2 (already in an active state) begins processing the connections that would
normally be processed by unit 1.

b. Unit 2 continues processing its own connections, in addition to those of unit 1.

When unit 1 becomes available again, you must manually initiate failback (System >
High Availability > Failback), which, in this case, means that the currently-active unit
drops all connections that it is managing on behalf of its peer, and continues to operate in
an active state, processing its own connections. Failback will cause a service impact to
the targeted virtual servers, unless connection mirroring is enabled.

Active/Active and Secure Network Address Translation (SNAT)

Each BIG-IP system in an active-active configuration has a unit ID, either 1 or 2. When
you define a local traffic management object, such as a virtual server, self IP or a SNAT,
you must associate that object with a specific unit of the active-active redundant pair.
When failover occurs, these associations of objects to unit IDs allow the surviving unit to
process connections correctly for itself and the failed unit.

If you do not associate an object with a specific unit ID in an active-active redundant pair,
the redundant system uses 1 as the default unit ID, however, you cannot associate a
default SNAT with a unit ID, as the default SNAT is not compatible with an active/active
system. Instead, you must configure SNAT automap on each individual unit.

Converting Between Redundancy Modes

One of the major caveats to running in active/active redundancy mode, is that it is


possible to ramp up traffic to over one hundred percent of the capacity of a single BIG-IP
LTM unit. If this is done, and a hardware failover occurs, the remaining unit will be
over-subscribed and an impact will follow. Since F5 Networks recommends reinstalling
the BIG-IP system software from CD-ROM when reconfiguring redundant pair units, it is
very important to consider the redundancy mode in the early phases of your BIG-IP
implementation.

Due to the large number of settings that must be changed using both the Configuration
utility and the command line when you convert a unit from active-standby to
active/active, or from active-active to active-standby, reloading and reconfiguring both
units is generally an easier and more reliable process than converting from one redundant
mode to another.
Best Practice: Out-of-band remote management of BIG-IP devices

BIG-IP platforms referenced in this document feature SCCP (Switch Card Control
Processor) - a separate subsystem that controls the F5 switch hardware. Because SCCP
allows console access to the host subsystem that is equivalent to the serial console
connection, it is now possible to fully manage and troubleshoot BigIP devices without
physically connection to a serial console.

Because almost any administrative and troubleshooting task can be accomplished through
SCCP, F5 highly recommends that customers enable remote network access to the SCCP.
Console access to the BigIP host subsystem is needed to perform upgrades while
observing their progress, troubleshoot and recover from system failures, run End-User
Diagnostics, power cycle the host subsystem, etc.

The SCCP shares the management port with the host subsystem. Therefore, in order to
access SCCP via the network, it must have an IP address on the same subnet as the
management interface in the host subsystem. For instance, if the management interface
of the host subsystem is 192.168.245.10, you should assign 192.168.245.11 to the SCCP
itself. The SCCP cannot share an IP address between itself and the host subsystem. The
instructions below illustrate how to setup IP connectivity on the SCCP and access the
subsystem console remotely.

Setting the SCCP IP address from the serial console

1. Establish console connection to the BigIP platform


2. Display the SCCP menu by simultaneously pressing and holding the Shift, ESC,
and 9 keys.

3. Type N to select the SCCP network configurator option

4. Enter the requested information as prompted. Please remember that all settings
below are completely independent from the management port settings on the host
subsystem. An example of configuration screen is below:

Use DHCP? n
Host name (optional): sccp1.mycompany.com
IP address (required): 192.168.245.11
Network mask (required): 255.255.255.0
Broadcast IP address (optional): 192.168.245.255
Default gateway IP address (optional): 192.168.245.1
Nameserver IP address (optional): 192.168.1.10

5. Test the SCCP login by opening an SSH session to the IP address that you
configured. Access SCCP using the management port and the same login
credentials that you use for the BigIP root account.
6. Now you can establish console access from the SCCP to the host subsystem by
typing the following command from the SCCP command prompt:
hostconsh

Setting the SCCP IP address remotely

F5 recommends that customers always setup the SCCPs IP address from serial console.
However, in rare cases when serial connectivity to the BigIP platform is not possible, you
can setup the SCCP IP address through the connectivity to the host subsystem

1. SSH into the BigIP host subsystem as root.


2. Connect through SSH to the SCCP, by typing the following command:

ssh sccp

An sccp# prompt will appear similar to the following:

Last login: Mon Jan 01 01:23:45 2006 from host


Welcome to the F5 Networks SCCP!
sccp#

3. Invoke the network configuration utility by typing the following commands:

cd /etc
netconfig

4. Follow the prompts to configure the SCCP IP connectivity. Please


remember that all settings below are completely independent from
the management port settings on the host subsystem. An example of
configuration screen is below:

SCCP Linux Management Network Configurator

Use DHCP? n
Host name (optional): sccp1.mycompany.com
IP address (required): 192.168.245.11
Network mask (required): 255.255.255.0
Broadcast IP address (optional): 192.168.245.255
Default gateway IP address (optional): 192.168.245.1
Nameserver IP address (optional): 192.168.2.10
Nameserver IP address (optional):

5. After running the netconfig utility, F5 highly recommends that you reboot the
entire platform to verify that the SCCP IP connectivity is properly configured and
is available following the power outage to the platform. If it is impossible to
reboot the platform, please skip this step and follow step 6. To perform the reboot
of the platform, exit the SCCP ssh session back to the BigIP host subsystem and
type the following commands:

touch /.sccp_hard_reboot
reboot

6. Please only follow this step if you were unable to perform step 5. Run the
following command to initialize the network interface on the SCCP:
/etc/rc.network

7. Test the SCCP login by opening an SSH session to the IP address that you
configured. Access SCCP using the management port and the same login
credentials that you use for the BigIP root account.

8. Now you can establish console access from the SCCP to the host subsystem by
typing the following command from the SCCP command prompt:

hostconsh

MAC Masquerading

For active/standby systems, you can create a media access control (MAC) masquerade
address that can be shared between the BIG-IP units. Doing so can minimize the impact
of a BIG-IP LTM failover event by responding with a consistent MAC address from the
newly active device.

When a BIG-IP becomes active one of the first things it does it perform an interface reset,
dropping carrier for an instant and then bringing the link back up and sending a gratuitous
ARP with all the vitual IPs for which it is now active. One potential problem in this
instance, may be the default switchport configuration of a connected layer 2/3 device.

For instance, by default legacy Cisco 65xx and 55xx ethernet switchports are configured
to perform several possible functions at startup:

a. Spanning-tree holddown
b. Etherchannel auto-negotiation
c. Trunking autonegotiation
d. Speed/Duplex autonegotiation

All of the above happens before the switch passes any traffic. If even one of these
functions is enabled (they are all on by default), the imposed delay might be enough to
drop the gratuitous ARP which announces the new active BIG-IP.

In modern versions of CatOS there is an option to turn on spanning tree "portfast", and
turn off etherchannel and trunking negotiation, potentially eliminating the delay at port
initialization and, thus, the need for MAC masquerading.
When you configure MAC masquerade, you must use a MAC address that is unique to
your network. If its not, a conflict will result. The only way to guarantee a unique MAC
addresses is to register as a vendor with the IEEE; however, you can easily find MAC
address ranges that are unused.

F5 Networks recommends that you construct a MAC masquerade address using the
vendor code 40:00:00. This code does not appear to have been used by a vendor and F5
Networks is not aware of any cases where use of this vendor code resulted in a conflict.
However, several other vendor code options exist.

To construct the full MAC masquerade address, use the existing serial number portion of
the pre-assigned MAC address in combination with the false vendor code.

For example:

If the unit's pre-assigned MAC address is 00:01:D7:01:02:03, you would change the
vendor code and keep the serial number, producing the following address:

40:01:d7:01:02:03

This method produces a unique MAC address for a MAC masquerade.

In some networks, enabling MAC masquerading will not be an option. If so, we


recommend that you increase the input buffer size or allow ARP updates to occur on the
connected devices. Changing this option will vary on different network devices, so
consult the manufacturers documentation for your specific device for information about
altering this behavior.

For instance, with Cisco IOS, you can change the size of the input hold queue by using
the hold-queue configuration command on the interface that is attached to the network
with the BIG-IP LTM. For example: hold-queue 400 in

System Fail-Safe

System services have heartbeats. The BIG-IP system continually monitors service
heartbeats to determine whether the service is still running. For some services, if the
system does not detect a heartbeat, the system takes some action with respect to failover.
These services are:

MCPD (messaging and configuration)


TMM (traffic management)
BIGD (health monitors)
SOD (failover)
BCM56XXD (switch hardware driver)

In general, the default behaviors are recommended, since restarting a critical service such
as BIGD or BCM56XXD is sufficient and a system failover event is not required.

The Traffic Management Microkernel (TMM) service, however, is the process running on
the BIG-IP system that performs most traffic management for the product. As such, the
TMM service supports all system and networking components that the BIG-IP system
needs in order to process application and administrative traffic. It controls all system
interfaces, except for the management interface so, if the heartbeat fails, it is appropriate
to both failover and restart the service.

When the TMM service is running, make sure that you have defined a default route on
the BIG-IP LTM. Defining a default route prevents the high volumes of administrative
traffic generated by the BIG-IP system from using the management interface (except for
NTP), which is limited to 100 Mbps of throughput.

VLAN Fail-Safe

When you configure VLAN failsafe for a VLAN, the BIG-IP system monitors network
traffic on the VLAN. If the BIG-IP system detects a loss of network traffic on the VLAN,
the BIG-IP will attempt to generate VLAN failsafe traffic to nodes or the default router
accessible through the VLAN in the following manner:

After half of the VLAN failsafe timeout value has elapsed, the following actions
occur. An ARP request for the IP address of the oldest entry in the BIG-IP ARP table
is initiated, and an ICMPv6 neighbor discovery probe (only if entries exist in the
BIG-IP IPv6 neighbor cache) is initiated

After 3/4 of the VLAN failsafe timeout value expires, the following actions occur. An
ARP request for all IP addresses in the BIG-IP ARP table is initiated, An ICMPv6
neighbor discovery probe (only if entries exist in the BIG-IP IPv6 neighbor cache) is
initiated, and an ICMP echo request to 224.0.0.1 (multicast ping) is initiated

The failover action is avoided if the BIG-IP system receives a response to the VLAN
failsafe traffic it generated.

F5 Networks strongly recommends using the default VLAN failsafe timeout. Setting the
timeout too low can cause system stability issues such as daemons restarting.

Beginning with version 9.2.5, you can configure the BIG-IP system to reset the timeout
counter when it receives any frame on the VLAN. However, be aware that 9.2.5 has a
known issue (fixed in 9.3), wherein certain types of outbound monitor traffic, such as
traffic from a gateway_icmp monitor, can cause the BIG-IP LTM system to reset the
VLAN failsafe timer, preventing failover from occurring.

To configure the BIG-IP system to reset the timeout counter when it receives any frame
on the VLAN, use the following bigpipe command:

b db Failover.VlanFailsafe.ResetTimerOnAnyFrame=true

Gateway Fail-Safe

Gateway failsafe allows you to configure redundancy between a failover pair of BIG-IP
systems that point to different gateways. The gateway failsafe option allows each BIG-IP
system to monitor the upstream gateway to which they are connected. If the gateway is
marked as DOWN, the BIG-IP system can failover to its partner system to prevent further
disruption to traffic.

This functionality makes use of gateway pools for monitoring the upstream device.
Therefore, the gateway pools must be assigned to their corresponding units (via the Unit
ID), since the pool information will be synchronized between the systems.

In most environments, gateway failsafe is not required since the upstream device is
normally configured for high availability as well. Because of this, the gateway address is
normally an HSRP or VRRP address which will survive the failure of the gateway device.

An easy way to determine if this is the case is to look at the MAC address of the gateway:

VRRP routers use a common MAC address of the format 00:00:5e:00:01:xx. The last
octet is the VRID or the VRRP virtual router or group identifier, which provides for 255
virtual routers in a network.
HSRP routers also use a common MAC address on all media which supports HSRP
(except token ring). The HSRP format is 00:00:0c:07:ac:xx (the last octet is the HSRP
group number).

Spanning Tree (STP) and network failover considerations

When using network failover with the Spanning Tree Protocol (STP), the blocking
behavior of neighboring switches will prevent the network failover heartbeat packets
from reaching the BIG-IP peer system. As a result, both BIG-IP systems enter an active
state, and the Link Down Time on Failover timer can exacerbate this issue.

For network failover to function properly, the BIG-IP LTM systems must be able to
transmit high availability pair state information to each other. When a switch connected
to the BIG-IP LTM has its STP port state in learning mode, the port will not forward
frames, and the high availability pair traffic will be discarded. As a result, each BIG-IP
LTM system will think its peer is down. When this occurs, the network failover timeout
period may expire before the switch port transitions to forwarding mode, causing the
BIG-IP LTM systems to enter active/active mode. Once the ports transition to forwarding
state, one BIG-IP LTM should demote itself to standby mode and normal operation will
resume.

However, if the Link Down Time on Failover timer is set to a non-zero value, the BIG-
IP LTM system that demoted itself to standby mode will drop its link when the switch
port transitions to a forwarding state. As a result, the STP forward delay period (the time
the port spends in the listening and learning states) is restarted and the issue cycles
indefinitely.

You can alleviate this issue by performing one of the following configurations, which are
listed in their recommended order:

1. Use RSTP instead of STP. The convergence time of RSTP is much faster than STP,
and the port will not be in a blocking state for as long as it would if you are using
STP.

2. Do not use the Link Down Time on Failover timer. This setting is only required
when the switches neighboring the BIG-IP LTM system fail to update their
FDB/CAM tables when they receive traffic on a different port. The Link Down Time
on Failover timer is rarely required.
3. Increase the network failover timeout to more than 15 seconds (the default STP
forward delay timer). Padding the value by several seconds is recommended.

S-ar putea să vă placă și