A common modern configuration for an open source software based TCP load balancer on the Linux platform is to use keepalived backed by ipvsadm, especially in direct routing mode. This configuration provides built-in VIP address cluster failover through VRRP and the speed advantages of direct routing. We have setup multiple configurations of this environment for customers, primarily for syslog load balancing.
Keepalived and ipvsadm have many methods for distribution of incoming connections (‘scheduling’). The three most common methods are Least Connection and Round Robin, both with weighted options, and Hashing based on either source or destination. The default mode is Weighted Least Connection.
It is not uncommon for a single load balancer or cluster to handle multiple VIP addresses. For example: In a common syslog configuration, there will be a VIP for one environment and a second VIP for another environment, for the purpose of having different backend processing on the same (or different) syslog servers.
There is a problem with ipvsadm that occurs at the intersection of scheduling methods and VIP address selection. This is a known but not well documented bug in ipvsadm that causes the load balancer to continue to send traffic to a given real server even if that real server fails without any external indication that traffic is still being forwarded. More specifically, this failure is facilitated when there are multiple VIP addresses for the same port on the same load balancer when using the least connection scheduling method. Connections do not have to be configured as persistent for this to happen, although this behavior is also seen in persistent configurations.
Example pseudo-configuration:
VIP: 192.168.0.10:514 (udp) Distribution: wlc (Weighted Least Connection) Real Servers: 192.168.0.101, 192.168.0.102, 192.168.0.103 VIP: 192.168.0.11:514 (udp) Distribution: wlc (Weighted Least Connection) Real Servers: 192.168.0.101, 192.168.0.102, 192.168.0.103
Normal ipvsadm –list –n:
Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn UDP 192.168.0.10:514 wlc ops -> 192.168.0.101:514 Route 1 0 0 -> 192.168.0.102:514 Route 1 0 0 -> 192.168.0.103:514 Route 1 0 0 UDP 192.168.0.11:514 wlc ops -> 192.168.0.101:514 Route 1 0 0 -> 192.168.0.102:514 Route 1 0 0 -> 192.168.0.103:514 Route 1 0 0
In this sample configuration there are 2 VIPs using weighted least connection scheduling to load balance between 3 real servers. If all 3 real servers are functioning normally no issues will occur. However, if any server (we will use .101) fails then the problem occurs. Once the real server fails to respond to checks, keepalived will remove the server from the pool and the ipvsadm output will reflect the expected state.
Failed ipvsadm –list –n:
Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn UDP 192.168.0.10:514 wlc ops -> 192.168.0.102:514 Route 1 0 0 -> 192.168.0.103:514 Route 1 0 0 UDP 192.168.0.11:514 wlc ops -> 192.168.0.102:514 Route 1 0 0 -> 192.168.0.103:514 Route 1 0 0
But at the kernel level, traffic that was connected to the failed .101 server will still be routed to that server without any external indication except for examining tcpdump output. This traffic will continue to be routed to the failed server until the kernel garbage collection timeout for the cached routes (net.ipv4.route.gc_timeout) expires, which by default is 300 seconds. Once that timeout is reached the connections / routes will be expired and finally rerouted to a valid real server. Unfortunately, by this point there will be 5 minutes worth of lost data that was sent through the existing failed connection, resulting in missing logs from whatever syslog sources were connected to that server before it failed.
The solution to fixing this problem is quite simple since that failure mode is specific to the Least Connection scheduling methods – switch to another scheduling method. For most configurations that are using Least Connection or Weighted Least Connection scheduling, the best substitute will be Round Robin or Weighted Round Robin. The round robin scheduling methods are not impacted by this bug.
For smaller organizations or groups that don’t have the budget for dedicated hardware load balancers, open source software load balancers are still a great option. There just has to be additional awareness of some of the nuances behind poorly documented bugs and pitfalls that lie along the way.
Would you like some assistance with your existing open source load balancer? Do you need help configuring a new one? Please feel free to contact us and we will be happy to help you out.
Doug Bell
Senior Infrastructure Consultant