The God is in the details
The system crashed after 7 days of full load testing!
Experts scrambled on-site to hunt the problem; it was a memory issue. The issue was fixed and all the test engineers were called in to reproduce the issue. 24 hours passed, there was no crash.
It crashed again with a full load, this time after 10 days; frustrating yet intriguing!
Our decision of testing the system at full load for a significant amount of time before it was put into production was paying off.
After a meticulous diagnosis, a memory leak was detected that was less than 100 bytes. This would never manifest as a crash during lab testing.
This explains the gravity of getting into finer details, especially when you are building on a large scale.
If the scalability target is a few billion requests/day, every minuscule problem is amplified!
Here are some of the things to watch.
– Eliminating unwanted data bytes/requests could reduce significant bandwidth utilization.
– A small memory leak due to an insignificant resource/request could bring the server down in a few hours; which is difficult to detect during lab testing.
– Small optimizations like eliminating small queries/requests can save billions of cycles of your DB CPA. Freeing up more capacity allows you to handle more requests.
How do you test for scalability?