Build Contingencies and Switches , when building scalable software

Build Contingencies and Switches , when building scalable software

Build Contingencies and Switches

Everything can go wrong – Build Contingencies and Switches

                                                  “Anything that can go wrong will go wrong”

                                                                                                                                -Our good old friend Murphy

Please believe him, especially when you are building systems of such scales.

Spend time purposefully to identify areas of failures in your systems like-

  • Identify the working set of your memory at average loads, peak loads, set thresholds for monitoring.
  • Identify performance bottlenecks, devise methods to free up those bottlenecks; usual suspects are Queues, Caches, DB in a concurrent environment. Devise elastic solutions to scale up these areas if needed.
  • Identify critical areas of your solution, strengthen your code, add code to profile, measure, and build redundancies around it.
  • Simulate these failures and test if your contingencies really work when needed.
  • In events of failure affecting business continuity, make sure you have a fallback plan B. (may be an existing system)
  • Account for external failures, HW crashes, Network glitches, router malfunctions. Build backup systems – data backup, Disaster recovery (hot or cold recoveries)
  • Account for security attacks Malware, Denial of Service, SQL injections, open ports, weak passwords, un-obfuscated application signatures, etc – plan and execute mitigation strategies, security code reviews, security testing, external security software, etc.

You don’t get a second chance to reproduce errors in live environments and capture logs & diagnostics.

You should have a hot switch to toggle any settings while the system is in progress. Changing the settings, in the settings file and restarting systems is never an option since systems at this scale take a huge amount of time to restart and again warm up; forget about reproducing the same state of components again.

It helps to have interfaces that allow toggling of settings while the system is up and running. This is a huge help in diagnosis (by getting selective verbose dumps of components, caches, etc), exactly when needed.

Share this post