{"id":68,"date":"2023-03-24T14:31:00","date_gmt":"2023-03-24T22:31:00","guid":{"rendered":"https:\/\/www.dumpsterfirecomputing.com\/?p=68"},"modified":"2023-03-24T14:31:00","modified_gmt":"2023-03-24T22:31:00","slug":"tinkering-with-resilient-servers","status":"publish","type":"post","link":"https:\/\/www.dumpsterfirecomputing.com\/?p=68","title":{"rendered":"Tinkering with Resilient Servers"},"content":{"rendered":"\n<p>This past week there were a few conversations about disaster recovery that I was involved in, and one of the interesting things I see all over is a lack of resilient system design.  One server running a critical business service, and the only protection it has from any sort of disaster is a backup.  We hope.  That type of system design has near-zero resilience, and would look something like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"440\" height=\"437\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-12.png\" alt=\"\" class=\"wp-image-69\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-12.png 440w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-12-300x298.png 300w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-12-150x150.png 150w\" sizes=\"auto, (max-width: 440px) 100vw, 440px\" \/><figcaption class=\"wp-element-caption\">No resilience<\/figcaption><\/figure>\n\n\n\n<p>This would be fine if the server is running a special version of Notepad++ that your team needs, but really not ideal for a Tier 1, mission-critical piece of software.<\/p>\n\n\n\n<p>There are different ways to slice and dice the issue and solve the problem, but one of the quickest ways is to stand up a second server and toss a load balancer in front of it, like this:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-13.png\" alt=\"\" class=\"wp-image-70\" width=\"437\" height=\"424\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-13.png 667w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-13-300x291.png 300w\" sizes=\"auto, (max-width: 437px) 100vw, 437px\" \/><figcaption class=\"wp-element-caption\">Regional Resilience<\/figcaption><\/figure>\n\n\n\n<p>In this model, you can at least patch and reboot each of these servers one-at-a time and not incur any real outage.  In Azure, building out a <a rel=\"noreferrer noopener\" href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/load-balancer\/\" target=\"_blank\">Public Load Balancer<\/a> is really quite easy to do and can be done in just a few clicks.<\/p>\n\n\n\n<p>One nice thing about this approach is that instead of each of your servers having a public IP attached to them, the public IP is instead attached to the front side of the load balancer.  So now your servers are at least not directly exposed to the internet.  But&#8230;what happens if there&#8217;s a regional outage?  It definitely can, and does, happen (just over a year ago a major <a rel=\"noreferrer noopener\" href=\"https:\/\/www.rcrwireless.com\/20211208\/telco-cloud\/aws-us-east-1-region-outage-cripples-amazon-and-hosted-services\" target=\"_blank\">AWS region was offline for a couple of hours<\/a>).<\/p>\n\n\n\n<p>Taking this one step further would be a multi-region or globally load-balanced option, and in this case I went with a Traffic Manager profile to sit in front of a pair of load balancers:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"870\" height=\"672\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-14.png\" alt=\"\" class=\"wp-image-71\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-14.png 870w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-14-300x232.png 300w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-14-768x593.png 768w\" sizes=\"auto, (max-width: 870px) 100vw, 870px\" \/><figcaption class=\"wp-element-caption\">Global Load Balancing<\/figcaption><\/figure>\n\n\n\n<p>Traffic Manager profiles will balance across the Public IP Addresses of the load balancer, and does so via DNS, so you need to make sure that the Public IP Addresses have DNS names attached to them.<\/p>\n\n\n\n<p>Building out a <a rel=\"noreferrer noopener\" href=\"https:\/\/learn.microsoft.com\/en-us\/azure\/traffic-manager\/traffic-manager-manage-profiles\" target=\"_blank\">Traffic Manager Profile<\/a> was also a pretty simple thing to do &#8211; you essentially just give it a name, attach the endpoints you want (Public IP Addresses in my case), and it starts working in seconds.  Below are some tests I did with a lab environment modeled after the above diagram.  I had two servers in Azure West (WESTSRV01 and WESTSRV02) and two servers in Azure East (EASTSRV01 and EASTSRV02).<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1012\" height=\"404\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-16.png\" alt=\"\" class=\"wp-image-73\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-16.png 1012w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-16-300x120.png 300w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-16-768x307.png 768w\" sizes=\"auto, (max-width: 1012px) 100vw, 1012px\" \/><\/figure>\n\n\n\n<p>I started with a PSPING test to the public DNS name on the Traffic Manager Profile.  My client resolved it to the load balancer IP for the Azure West servers, and you can see that we&#8217;ve first connected to WESTSRV01.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"894\" height=\"545\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-17.png\" alt=\"\" class=\"wp-image-74\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-17.png 894w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-17-300x183.png 300w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-17-768x468.png 768w\" sizes=\"auto, (max-width: 894px) 100vw, 894px\" \/><\/figure>\n\n\n\n<p>After powering off WESTSRV01, you can see we dropped two pings but the service quickly switched over to WESTSRV02.  You can do some tuning on both the Traffic Manager Profile as well as the Load Balancer to tighten up the failover time for sure &#8211; in my case I used mostly defaults which, for only having two dropped pings, isn&#8217;t bad.<\/p>\n\n\n\n<p>The next test was to shut off WESTSRV01 and watch it fail over to the pair in Azure East.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"634\" height=\"400\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-18.png\" alt=\"\" class=\"wp-image-75\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-18.png 634w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-18-300x189.png 300w\" sizes=\"auto, (max-width: 634px) 100vw, 634px\" \/><\/figure>\n\n\n\n<p>In this case, it took a few seconds for the page to refresh and show EASTSRV01, and this is almost certainly cured by tuning the health probe timeouts and such.  But notice all the dropped pings?  Because Traffic Manager is a DNS-based global load balancing solution, my client machine needs to resolve the DNS name to an IP address again.  By stopping the PSPING tests and restarting it, we&#8217;re back to happy ping tests:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"508\" height=\"259\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-19.png\" alt=\"\" class=\"wp-image-76\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-19.png 508w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-19-300x153.png 300w\" sizes=\"auto, (max-width: 508px) 100vw, 508px\" \/><\/figure>\n\n\n\n<p>The final tests of course is to shut off EASTSRV01 and watch it roll to EASTSRV02:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"514\" height=\"362\" src=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-21.png\" alt=\"\" class=\"wp-image-78\" srcset=\"https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-21.png 514w, https:\/\/www.dumpsterfirecomputing.com\/wp-content\/uploads\/2023\/03\/image-21-300x211.png 300w\" sizes=\"auto, (max-width: 514px) 100vw, 514px\" \/><\/figure>\n\n\n\n<p>By this point in my test, three of the four servers in my lab environment are all powered off, and yet service is still available.<\/p>\n\n\n\n<p>This was a fun lab to set up and watch.  I hope to work with teams in the future to consider these types of solutions for adding resiliency to their critical servers.  In recent DR exercises, I&#8217;ve seen failover times extend longer than they should.  If you can build your environment to failover automatically, wouldn&#8217;t that be better?  Sure, there&#8217;s costs associated with building it out, but I know I don&#8217;t want to be called at 2am on Christmas Day for a mission-critical outage&#8230;..<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This past week there were a few conversations about disaster recovery that I was involved in, and one of the interesting things I see all over is a lack of [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[4,18,6,19],"tags":[10,21,20],"class_list":["post-68","post","type-post","status-publish","format-standard","hentry","category-azure","category-load-balancer-azure","category-management","category-traffic-manager","tag-azure","tag-disaster-recovery","tag-resiliency"],"_links":{"self":[{"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=\/wp\/v2\/posts\/68","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=68"}],"version-history":[{"count":1,"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=\/wp\/v2\/posts\/68\/revisions"}],"predecessor-version":[{"id":79,"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=\/wp\/v2\/posts\/68\/revisions\/79"}],"wp:attachment":[{"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=68"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=68"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.dumpsterfirecomputing.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=68"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}