Using haproxy as a PSC load balancer


When designing a vSphere 6.0 environment with multiple vCenter servers you will - in most cases - end up with the need to deploy external Platform Services Controllers (PSCs). If you are unsure what topology to choose then you should take a look at the PSC Topology Decision Tree that was recently published by VMware. It will guide you to the topology that suits your requirements best.

Since the PSC hosts the critical Single-Sign-On (SSO) component a specific requirement is to make an external PSC highly available so that you are still able to log on to vCenter even if one PSC fails. Currently the only supported way to implement a seamless automatic failover from a failing PSC to another one is to put multiple PSCs (of the same SSO domain and site) behind a load balancer. The process of properly configuring the load balancer and the vCenter servers behind it is quite complex, so most people refrain from it and just deploy a secondary PSC to that they manually re-point the vCenter servers if the primary one should fail (as per KB2113917). But this is a manual process (although it can of course be automated as William Lam explained in this post) and it takes a restart of all vCenter services during which vCenter will be unavailable.

This is why I wanted to try out in the lab how complicated it really is to implement load balanced PSCs and how well they work. However, I did not have a supported load balancer available in the lab - currently only Citrix Netscaler, the F5 BIG-IP and VMware's own NSX-v are officially supported for vSphere 6.0. All quite expensive options and no quick and easy deployments. So I decided to try my luck with the standard Open Source load balancer: haproxy. It turned out that this works very well and can be implemented quite quickly. Here is how:

The following guide assumes that you have two external PSCs already deployed in the same SSO domain and site. Follow the installation instructions in the vSphere docs to do this. There is also a nice walk-through available.


1. Deploy a Linux VM

To get started you need to deploy a Linux VM for installing haproxy. At first I searched for ready-made virtual appliances that include haproxy, but did not really find one. As we will see later haproxy is very easy to install, so I decided to just quickly deploy a minimal Debian Linux box for this purpose:
  • Create a new VM with the Guest OS Debian GNU/Linux 8 (64-bit) and minimal resources (1 vCPU, 1 GB RAM, 2 GB hard disk)
  • Download the Debian network installation CD for the amd64 architecture and boot your new VM from it to install Debian
  • For the software to install choose only the pre-selected SSH server and standard system utilities options
  • Configure networking appropriately and make sure that you have both a DNS and DNS PTR (Reverse lookup) entry available for the machine name

2. Install the software

Log in to the haproxy server via ssh. By default you cannot remotely log in as the root user - log in as the standard user (that you created during the Debian installation) instead and switch to the root user by running

   su -l

Then install the haproxy software. Although this is not necessary to run haproxy I also recommend installing VMware Tools. You can install both packages from the Debian repositories by running

  apt-get install open-vm-tools haproxy

That's it ... yes, really! Sometime I wish that installing software on Windows would be as easy as this ...


3. Prepare the first PSC

Now it's time to carefully look at KB2113315 which describes how to configure VCSA based external PSCs for high availability. If you are using Windows based external PSCs then you need to look at KB2113085 instead. I cannot really think of any good reason to use Windows for external PSCs, so I will solely focus on VCSA based PSCs in the following steps.

I will also assume that you are using the default self-signed certificates for your environment and have not replaced them with custom ones. This will make the process much easier ... you will then only need to run through the steps 1, 2, 5 and 8 of the section B to prepare the first PSC:

Step 1:

Step 2:
  • Log in to the first PSC via ssh and create the directory /ha by running
       mkdir /ha
  • Copy the downloaded zip file to the first PSC using an scp tool like WinSCP. This is a real challenge ... I suggest that you look at my VCSA 6.0 tricks post to learn how to use WinSCP to exchange files with the VCSA/PSC.
  • After you have copied the zip file into the folder /ha unzip it with the commands
      cd /ha; unzip VMware-psc-ha-6.0.0-2924684.zip
Step 5:

Run these command to generate the certificates needed for the load balancer:

 cd /ha
 python gen-lb-cert.py --primary-node --lb-fqdn=load_balanced_fqdn


Replace load_balanced_fqdn with the FQDN of the haproxy machine (e.g. something like haproxy.lab.local).

Step 8:

Copy the generated certificate and key files to the directory /ha/keys. Like advised in the KB article run

 mkdir /ha/keys
 cp /etc/vmware-sso/keys/* /ha/keys



4. Prepare the second PSC

Now it's time to run through section C of the KB article and prepare the second PSC. For this you need to transfer the complete directory /ha from the first PSC node to the second. There are probably many ways to do this ... here is what I did:
  • Zip the files on the first PSC by running
     cd /; zip -r ha.zip /ha
  • Use WinSCP to download the file /ha.zip to your workstation and from there upload it to the root directory of the second PSC.
  • On the second PSC unzip the file again by running
      cd /; unzip ha.zip
On the second PSC change to the created /ha directory now:

cd /ha

and run this command:

python gen-lb-cert.py --secondary-node --lb-fqdn=load_balanced_fqdn --lb-cert-folder=/ha --sso-serversign-folder=/ha/keys

Please note: This is one line. Again replace load_balanced_fqdn with the FQDN of the haproxy machine.


5. Configure haproxy

Section D of KB211315 asks you to configure a compatible load balancer and points you to the relevant articles for the F5 BIG-IP and Citrix Netscaler.

We will configure haproxy instead ...

Extract the file ha.zip that you downloaded from the first PSC on your workstation in Step 4. We need the files lb.crt (the load balancer certificate) and lb-rsa.key (the corresponding unencrypted private key).

Log in to the haproxy VM, become root, and
  • change to the directory /etc/haproxy and save a copy of the original haproxy configuration file:
      cd /etc/haproxy; mv haproxy.cfg haproxy-orig.cfg
  • Create a new file named psc-frontend-443.pem using a text editor (like vi) and use copy-and-paste from your workstation to copy the contents of lb.crt followed by the contents of lb-rsa.key into it.
      vi /etc/haproxy/psc-frontend-443.pem

    The result should look like this:
Certificate/Key-file for haproxy (contents shorted with "...")
  • Create a new haproxy configuration file with
      vi /etc/haproxy/haproxy.cfg
    and copy the following contents into it
global
        log /dev/log    local0
        log /dev/log    local1 notice
        chroot /var/lib/haproxy
        stats socket /run/haproxy/admin.sock mode 660 level admin
        stats timeout 30s
        user haproxy
        group haproxy
        daemon

        # Default SSL material locations
        ca-base /etc/ssl/certs
        crt-base /etc/ssl/private

        # Default ciphers to use on SSL-enabled listening sockets.
        # For more information, see ciphers(1SSL). This list is from:
        #  https://hynek.me/articles/hardening-your-web-servers-ssl-ciphers/
        ssl-default-bind-ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!DSS
        ssl-default-bind-options no-sslv3

defaults
        log     global
        mode    http
        option  httplog
        option  dontlognull
        timeout connect 5s
        timeout client  10s
        timeout server  10s
        errorfile 400 /etc/haproxy/errors/400.http
        errorfile 403 /etc/haproxy/errors/403.http
        errorfile 408 /etc/haproxy/errors/408.http
        errorfile 500 /etc/haproxy/errors/500.http
        errorfile 502 /etc/haproxy/errors/502.http
        errorfile 503 /etc/haproxy/errors/503.http
        errorfile 504 /etc/haproxy/errors/504.http

        retries 1

frontend psc-frontend-443
        bind :443 ssl crt /etc/haproxy/psc-frontend-443.pem
        mode http
        option http-keep-alive
        default_backend psc-backend-443

frontend psc-frontend-389
        bind :389
        mode tcp
        default_backend psc-backend-389

frontend psc-frontend-636
        bind :636
        mode tcp
        default_backend psc-backend-636

frontend psc-frontend-2012
        bind :2012
        mode tcp
        default_backend psc-backend-2012

frontend psc-frontend-2020
        bind :2020
        mode tcp
        default_backend psc-backend-2020

frontend psc-frontend-2014
        bind :2014
        mode tcp
        default_backend psc-backend-2014

backend psc-backend-443
        mode http
        stats enable
        stats uri /haproxy?stats
        stats realm haproxystats
        stats auth admin:admin
        stats refresh 5
        stats admin if TRUE
        option httpchk OPTIONS /websso/
        server psc001 192.168.40.80:443 ssl check inter 1000 verify none
        server psc002 192.168.40.81:443 ssl check inter 1000 backup verify none

backend psc-backend-389
        mode tcp
        server psc001 192.168.40.80:389 check inter 1000
        server psc002 192.168.40.81:389 check inter 1000 backup

backend psc-backend-636
        mode tcp
        server psc001 192.168.40.80:636 check inter 1000
        server psc002 192.168.40.81:636 check inter 1000 backup

backend psc-backend-2012
        mode tcp
        server psc001 192.168.40.80:2012 check inter 1000
        server psc002 192.168.40.81:2012 check inter 1000 backup

backend psc-backend-2020
        mode tcp
        server psc001 192.168.40.80:2020 check inter 1000
        server psc002 192.168.40.81:2020 check inter 1000 backup

backend psc-backend-2014
        mode tcp
        server psc001 192.168.40.80:2014 check inter 1000
        server psc002 192.168.40.81:2014 check inter 1000 backup


  • Finally restart the haproxy service by running
      service haproxy restart

So what the hell does all this mean? I will try to give you a basic understanding of how haproxy works and the configuration directives it uses. To get the full picture and for possible adaptions you will of course need to refer to the detailed haproxy documentation.

I took the sections global and defaults from the original haproxy.cfg file that was installed with the package, adapted it slightly, and added a bunch of frontend and corresponding backend definitions.

What is a frontend?

A frontend is a port that haproxy will listen on for incoming requests. Towards the vCenter servers haproxy needs to behave exactly like a regular PSC, so it needs to listen on all the ports that a normal PSCs listens on.

On a frontend haproxy can forward basic tcp connections (mode tcp), but it can also act as an http(s) proxy (mode http): For the psc-frontend-443 (lines 39ff.) it uses the load balancer certificate and private key to decrypt incoming https requests and forward them to the real PSCs. All other frontends just relay tcp connections for the ports 389, 636, 2012, 2014 and 2020.

All frontends forward incoming traffic to backends. In our case we have defined a single corresponding default_backend for every frontend.

What is a backend?

A backend is a group of servers (and network ports) to which the frontend traffic is forwarded to. In our example each backend includes our two PSCs with IP addresses 192.168.40.80 and 192.168.40.81. These IP addresses is the only thing that you need to adapt to make the configuration file work for your own environment!

For the second PSC we use the keyword backup to tell haproxy that it should always use the primary PSC (psc001) and fail over to the secondary (psc002, backup) only when the primary one fails.

How does haproxy detect server failures?

For tcp backends haproxy will by default try to connect to each of the servers on the configured port. By adding the keywords check inter 1000 I tell it to do this check every second (= 1000ms). If the connect fails because the server (or only the service listening on that port) is down then haproxy will mark the server as failed (only for this backend, not globally!)

In case of the http backend I used the option httpchk which tells haproxy to check the availability of the service with an http command. In our example I have configured it to not use an expensive full HTTP GET command, but a lightweight HTTP OPTIONS command (on the URL /websso/) to do the check. The option ssl in the server line instructs haproxy to use https (instead of plain http) for the check, and because of the verify none parameter it does not do any certificate checks on the target servers.

How can you monitor what haproxy is doing?

haproxy has a nice builtin web based dashboard to display the status of the frontends, backends and backend servers as well as some traffic and connection statistics. In our configuration file it is enabled through the stats directives in the psc-backend-443 definition. There we define that the dashboard shall be accessible through the special URL /haproxy?stats and requires a login using the username admin with password admin. Later in my post I will provide an example of how the dashboard looks like.


6. Update the Endpoint URLs to point to the load balancer

But watch out, we are not done yet with setting up our load balanced PSCs! We now need to run through section E of KB2113315 and update some URLs (that are stored in the PSCs' internal LDAP database) to point to the load balancer.

Connect to the first PSC, change to the /ha directory...

cd /ha

and run the command

python lstoolHA.py --hostname=psc_1_fqdn --lb-fqdn=load_balanced_fqdn --lb-cert-folder=/ha [email protected]

Please note: This is one line! Replace psc_1_fqdn with the FQDN of the first PSC and load_balanced_fqdn with the FQDN of the haproxy server.

The command will prompt you for the password of the local SSO admin [email protected]. Be sure to provide it correctly. The command will happily run from beginning to end even with the wrong password and will then spit out some error messages that are easy to miss.


7. Re-point existing vCenter servers to the load balancer

The setup of the PSCs and load balancer is now complete. As mentioned in section F of KB2113315 you can now install a new vCenter server and point it to the load balancer for SSO configuration.

If you already have vCenter servers deployed that use one of the load balanced PSCs directly then you need to re-point them to the load balancer. Fortunately this is a straightforward process, see KB2113917 for instructions.


This is unsupported. Can we be sure that it works well though?

Well, for sure we can just test ... Of course I did this in the lab and found that the vCenter servers using the load balancer functioned well and did not show any issues (compared to when they were using a real PSC directly). When I powered of the primary PSC haproxy immediately detected this and switched over to the second PSC. This is what the haproxy monitoring dashboard looks like in the situation:

haproxy dashboard legend

haproxy status: active DOWN, backup UP
In vCenter I did not notice any interruptions - I could log in to new Web Client sessions without issue, and even already running sessions appeared to be not affected or needed a reload only.

After I rebooted the first PSC and it became online again haproxy immediately stopped using the backup PSC and routed all connections to the primary one again.

Great test results! But there are also some theoretical proofs that our setup is a good working choice. In KB2112736 where the supported load balancers are listed you will also find a list of requirements that a supported load balancer must fulfill. For vSphere 6.0 these are the following:
  • Ability to load balance multiple TCP ports
  • Ability to load balance multiple HTTP/S ports
Clearly haproxy meets these requirements!
  • Ability to have session affinity to a single node (i.e. Sticky Sessions)
haproxy supports sticky sessions, but in our example we do not use them. It is really not necessary here, because we only have two PSCs, and at any given time only one of them is used. So all connections will always stick to the one PSC that is in use.
  • Ability to have session affinity to the same PSC node across all configured ports (i.e. Match Across Services)
Well, this is really a weak point when using haproxy. There is no straightforward way to group backends in a way that - if one service fails on a server - it is also no longer used for any of the other backends' services. So when for some reason only the web service dies on the first PSC haproxy will forward web requests to the second PSC, but continue to use the first PSC for all the tcp connections. This is only a drawback of our example configuration that I intentionally kept simple.

You can somehow provoke the right behavior by checking the same service (e.g. the https port) for all backends: haproxy allows to define server checks in a very flexible way, so you can also have it check the https service for all the tcp ports and declare e.g. the LDAP service on port 389 failed if the web service is not answering. But this would also mean that the tcp services are not checked at all anymore, and a failure e.g. on port 389 would not trigger a failover to the second PSC.

I have not really found a good and simple solution to this problem. If you are experienced in configuring haproxy and have a good idea then please chime in and post a comment!
  • Ability to control request timeout intervals
This is an easy requirement again, and haproxy can handle that. BTW I have defined common connections timeouts for all frontends and backends with the timeout directives in the defaults section of the configuration file.


Advanced haproxy features

I already mentioned the dashboard, sticky sessions and flexible server checks. You can also have haproxy sending out e-mail alerts when a server fails.

Another feature that I found very useful is that you can manually mark servers as down (or in maintenance mode) through the dashboard. That comes in handy for planned maintenance, e.g. software updates, on PSC node.

Additionally I want to point out that you can also cluster multiple haproxy servers to form a highly available load balancer. What good is a load balancer when it is itself a new Single Point Of Failure (SPOF)? So if you implemented haproxy as a load balancer for a production environment then you would definitely make it redundant.

This kind of clustering is not built into haproxy itself, but can be easily implemented in combination with other Open Source solutions. There is e.g. a post that explains how to make haproxy highly available with the heartbeat software. I will probably try that next in the lab :-)


Wrap-up

haproxy is an excellent and widely used Open Source load balancer software. I would really love to see VMware officially supporting it for load balancing PSCs (and other VMware product components), and I will file a corresponding feature request soon. Please consider to do the same if you like to use affordable easy-to-use Open Source software in your environment!



This post first appeared on the VMware Front Experience Blog and was written by Andreas Peetz. Follow him on Twitter to keep up to date with what he posts.



5 comments:

  1. Great article, if I had time I would have tried it myself!
    Just one note for the sake of completeness you could use a free edition of Netscaler which has 5Mbps throughput limitation on load balancing which should not be a problem in this scenarios.
    Could be a viable solution for production.

    ReplyDelete
  2. To change this behavior:

    "So when for some reason only the web service dies on the first PSC haproxy will forward web requests to the second PSC, but continue to use the first PSC for all the tcp connections."

    you should use "track" option for backends, also having separate frontend and backend blocks for tcp is useless in this configuration, use listen.
    So it'll look like:

    listen psc-389
    description psc LDAP port 389
    mode tcp
    bind *:389

    server psc001 192.168.40.80:389 track psc-backend-443/psc001
    server psc002 192.168.40.81:389 track psc-backend-443/psc002 backup

    ReplyDelete
  3. To change this behavior:

    "So when for some reason only the web service dies on the first PSC haproxy will forward web requests to the second PSC, but continue to use the first PSC for all the tcp connections."

    you should use "track" option for backends, also having separate frontend and backend blocks for tcp is useless in this configuration, use listen.
    So it'll look like:

    listen psc-389
    description psc LDAP port 389
    mode tcp
    bind *:389

    server psc001 192.168.40.80:389 track psc-backend-443/psc001
    server psc002 192.168.40.81:389 track psc-backend-443/psc002 backup

    ReplyDelete
    Replies
    1. Hi Serg,
      thanks for sharing! I will look into this.
      Andreas

      Delete
    2. Interesting read. I have set something up very similar except I used nginx and their stream module (https://docs.nginx.com/nginx/admin-guide/load-balancer/tcp-udp-load-balancer/ ) instead of haproxy. I took it one step further and deployed a second nginx (load balancer) VM with the same config. Then, I created a VIP by installing keepalived on the 2 nginx VMs making them active/passive. Everything gets pointed to the VIP. If there are any issues the VIP is moved to the VM. Keepalived is simple to setup and provides the same HA mechanism as Netscalers and the other ADC's. Keepalived would also work with haproxy. I just prefer nginx and now with the TCP/UDP stream module it can do anything. nginx + keepalived = free netscalers... Hopefully F5 aquiring nginx doesn't ruin it.

      Delete

***** All comments will be moderated! *****
- Please post only comments or questions that are related to this post's contents!
- Advertising and link spamming will not be tolerated!