The LocalDirector

 

The LocalDirector was conceived in late 1995 when I was having meeting with Mr. Robert Andrews, Webmaster of Netscape Communications.  Robert had a problem with their 10 or so of their customer’s servers.  The servers were using round robin DNS and had the problem of some of them getting very busy (an IP address cached by a big ISP’s DNS system, for example) while other machines were doing almost nothing.  I drew on the omnipresent white board at our offices on the Palo Alto Airport a picture of a box that had the servers on one side and the internet on the other.  I had the idea of creating a layer 2 device (bridge) that, if the software failed, would be transparent to traffic.  Then on that layer 2 device, I proposed building at layer 3 and above a virtual machine .  This virtual machine would then map to all physical machines on the server side of the box.  The initial implementation allowed for 1024 physical machines to be mapped to a single virtual machine in the LocalDirector.


I remember the project taking 2 engineers (Brantley Coile and Richard Howes) to complete and it was ready for first customer ship in about 6 months.  On June 10, 1996 I was diagnosed with cancer.  In the 3 days I had between finding out I had cancer and the surgery, I did many things.  Ed Kozel (my manager at the time) told me to just drop everything.  My action to drop the LocalDirector product was to call Brantley (who lived in Athens, GA) and ask him if he could relocate to our Palo Alto office for a short time while I went through my surgery and recovered.  He kindly agreed.  It is my recollection that the LocalDirector was only a small number of weeks from FCS at that point.  Brantley and Richard worked on that while I was out of commission.


So, as for features, the LocalDirector (LD) had the following features at first customer ship:


- The LD's core functionality AT FCS was Load Balancing and Fault Tolerance.

- Load Balancing was supported on up to 1024 physical machines.

- The LD operated at layer 3, 4 and above.

- The concept of the LD was to determine how busy a server was at both the system level (TCP response time) and at the application level (application response time) 

- The original SDA (Session Distribution Algorithm) was built to measure these response times and produce a factor for each server.

- This factor for each server allowed a network of heterogeneous servers to be used (some fast, some slow) and the connections would be distributed evenly among all servers.

- If a physical server got into a resource deficient condition (a lack of memory for example), then the system and/or application response time would increase and therefore new connections would not be sent to that server until the condition was resolved and the response time improved.

- If half of the physical servers lost electricity, for example, then the system and application response times would go to infinity.  The LD would notice this condition and stop sending connections to that server.  If this persisted, then that server would be marked down.  The LD would then periodically try to contact that machine (once per minute, if I recall).  If the failed machine started responding again (say power returned), the LD would send just one connection to that server.  If it handled the task well, the LD would notice that (the SDA factor would decrease) and it would send another connection and so on, until the server SDA factor roughly matched the other physical servers in the farm.

- The LD did all the work at layer 3 (IP) and above.  Layer 2 was done only to assure a transparent connection between the Internet router and the servers.


Many more SDA's were added to the LD as the years went by.


More information to come...