Best tips for building massively scalable web API platforms

By Sachin Kalaskar Articles #scalability #softwaredevelopment #webapi #enterprsietech Comments Off

“Will you be able to build a web platform that can deliver 60,000 requests per second, with a sub second response time?“

This was the requirement given to us in a customer meeting.

This was our opportunity to build a challenging project from scratch and I think I can say very confidently that, very few companies and people get such opportunities. We were jumping with joy, but then we realized the scale of the challenge.

I want to highlight the practices that helped us deliver the solution for the above challenges.

These practices, I believe can be termed as ‘Best Practices‘ that have come forth from the entire journey of Designing, Building, Testing, and Deploying the solution.

1) Theme of your Project

The Theme of a project is difficult to decide, it needs to be carefully derived from the project definition and requirements.

The Theme of the project is a summary of what is expected to be reached by the project. It is the foundation on which the project is built and it drives the project.

Project Theme is a central idea or conceptual approach that determines how individual work components (processes, tasks, activities) best connect to each other to produce a cumulative effect that is far greater than the total effect generated by the individual components.

Blazing Speed, Massive Scalability, Solid Reliability, and Extreme Flexibility are some sample themes.

2) Know your Performance targets

You won’t get a performant system, by asking someone to build a “really fast system”.

Demands of the performance vary for different problems and domains; and accordingly, it needs very different solutions and technologies to deliver these performances.

Software Performance is absolute; you get what you demand.

For Example-Hotel Management System – 10s of thousands of requests per day

Ticket Reservation Systems – 100s of thousands of requests per day

Security Platforms – 10s of millions of requests per day

Google Search – billions of requests per day

You can choose the right solution if you know the performance targets. If performance is your objective, always Design, Build and Verify a system towards a performance target.

Sample performance and scale targets –

8 Billion requests of type X, Y, and X to be served reliably within a day

75K requests of type X, to be served by the platform with response times < 1sec

Effective Pass Rates: > 98% (measured based on HTTP response codes)

99.99 % uptime

3) Have an Eye for Details

Details are very important for all kinds of projects. Especially at a large scale, seemingly small problems can spiral out of control due to negligence of minute details.

If a billion requests per day are your target to scale, every minuscule problem is amplified! You need to perform activities like memory usage analysis, track network bandwidth usage, single-step your code, strict code reviews, etc.

Examples-

Eliminating unwanted data bytes per request could mean a reduction in the utilization of bandwidth. Bandwidth saved is bandwidth earned for some other operation.

A small memory leak of an insignificant resource per request could bring down a production server in less than a few hours, even if it might go undetected in your test labs.

Small optimizations lead to eliminating a small query per request, which can mean billions of cycles of your DB CPU saved, and multi MBs of bandwidth saved; which automatically means freeing up more capacity to handle more requests.

The impact of such (seemingly) small inefficiencies is only visible in production.

4) Measuring your System Health

Let’s visualize a situation-

After having done the best possible development and testing for the proposed application the system is now in a production / BETA environment.

It is subjected to real Variety-Velocity-Veracity of live data.

It starts creaking & showing a slew of performance issues, slow responses, dropped requests, DB performance drops, Queues choked, etc.

And, guess what??

These symptoms were never visible in the test labs.

The system is live and being used by customers and/or end-users.

There is no option of now introducing debug builds, debugging, or any other intrusive mechanisms, which would further aggravate the already deteriorating end-user experience.

How do we diagnose and fix them??

Sensors give you a starting point for diagnosis and analysis. They capture application-specific events, errors, important artifacts like Q lengths, Q wait sizes, request stats (size, latencies, distribution), cache sizes, cache effectiveness, DB hits, etc.

Application Sensors used for specific application counters, have to be identified and planted in the code at the right locations. If done right, they can give you invaluable information of what is happening inside the system.

Do not mistake Application Sensors for your diagnostic verbose logging, these are needed in addition to your diagnostic logging.

5) Shift-Left Testing

Shift-Left Testing is a transformation from Defect detection to defect prevention”

Traditionally Testing has been at the very right extreme of Requirements, Design, Development, and Testing of the software development cycle. Shift-Left concept says move your testing to the left to involve testing and testers in every phase.

This approach helps detect and fix defects very early in the game, rather than detecting them in your traditional test cycles at the extreme right. It gives you time to understand and design the test scenarios, and test cases and identify automation hotspots.

This is a great article on Shift- Left testing for a deeper understanding of the approach.

But here are 3 important points you must keep in mind for large-scale projects-

The impact & cost of the defects is amplified by the scale if found in later stages.
Gives valuable time and understanding to develop automation, which is of paramount importance to test at the said scale.
Ensures that the theme of the project (Performance and Scalability) is ingrained in all stages of the project.

6) Extreme Automation

Automation plays a very important role in product and project development. Delivering platforms and systems at this scale is not possible without extreme automation.

How do you get the confidence to host your system in production and expect it to process millions and billions of requests?

You need to have dedicated Automation architects working in a Shift-Left mode. Right from the requirements stage to identifying, designing, developing, and verifying the automation in time for it to be useful from day one.

Examples of ‘Extreme Automation’-

Generate huge amounts of test data for different load and data profiles.
Generate huge amounts of requests in minimal time to simulate massive concurrency.
- This could easily mean setting up farms of JMeter or Tsung Nodes, which need additional management and automation.
Automated BVTs in your CICD pipelines for functional sanity and regressions.
Nightly automated performance and soak tests to detect daily progress and variations.
Daily automated performance tests and reports to analyze changes of everything going in the builds. It helps in the early detection of big deviations.
Automated reports result in an easy-to-read and easy-to-understand fashion. On a daily basis, this saves a huge amount of time in analysis!
Monitoring, Capturing, and Analyzing application system counters to verify resource utilization (CPU, Memory, IO, Caches, handles, etc)

The list goes on…

7) Controlled Roll-Outs

How do you seamlessly roll out your platform for a 25+ Million user base (hopefully) without causing service disruption?

There is no easy and canned solution for this. Based on the type of application and the criticality you have to do what works best for you.

Once your platform is ready to be rolled out, based on our experience we would suggest you to follow the guidelines given below-

Design multiple Beta’s to scale up to final numbers, features, and geographies.
Even after making the platform, robust, secure, and scalable – Identify the things that could go wrong. Like- servers, DB, overload, network congestion, never-seen-before load profiles, network attacks, etc. Start to build contingencies (in the platform and the clients), fallbacks for such situations, robust update mechanism, monitors, and switches to fallback to existing solution.
In a very controlled environment, perform drills for certain situations, by simulating certain situations.
Now with all the contingencies, monitors, tools, etc ready – Start rolling out the designed Beta,s carefully and monitor them diligently. It will take some time to stabilize, tweak, scale, secure, and occasionally fix the components and environments when subjected to the live data loads. This is normal.

This entire exercise will give you confidence and help you roll out to the entire user base. It will also give you the following invaluable information and experience in the earlier stages of the rollouts.

Different load profiles, vis-à-vis the time, geographies
Live Benchmarking & Capacity planning of individual components and the system
Identifying average-peak loads, avg-peak performance
Tweak & strengthen the environment, system, and application
Arrive at acceptable thresholds

8) Everything can go wrong – Build Contingencies and Switches

“Anything that can go wrong will go wrong”

-Our good old friend Murphy

Please believe him, especially when you are building systems of such scales.

Spend time purposefully to identify areas of failure in your systems like-

Identify the working set of your memory at average loads, and peak loads, and set thresholds for monitoring.

Identify performance bottlenecks, and devise methods to free up those bottlenecks; usual suspects are Queues, Caches, and DB in a concurrent environment. Devise elastic solutions to scale up these areas if needed.

Identify critical areas of your solution, strengthen your code, add code to profile, measure, and build redundancies around it.
Simulate these failures and test if your contingencies really work when needed.

In events of failure affecting business continuity, make sure you have a fallback plan B. (may be an existing system)

Account for external failures, HW crashes, Network glitches, and router malfunctions. Build backup systems – data backup, Disaster recovery (hot or cold recoveries)

Account for security attacks Malware, Denial of Service, SQL injections, open ports, weak passwords, un-obfuscated application signatures, etc – plan and execute mitigation strategies, security code reviews, security testing, external security software, etc.

You don’t get a second chance to reproduce errors in live environments and capture logs & diagnostics.

You should have a hot switch to toggle any settings while the system is in progress. Changing the settings, in the settings file and restarting systems is never an option since systems at this scale take a huge amount of time to restart and again warm up; forget about reproducing the same state of components again.

It helps to have interfaces that allow toggling of settings while the system is up and running. This is a huge help in diagnosis (by getting selective verbose dumps of components, caches, etc), exactly when needed.

Final Words

Building a high-performance and massively scalable system needs more than just a good design and architecture. The entire team needs to eat-drink-sleep performance and scalability.

The above practices have helped our team phenomenally to not only meet but also beat the performance and scalability targets, not once but multiple times.

These are some key learnings, from our journey in achieving the stiff performance SLAs that we had set out for.

Would love to see any additions to this list, based on your first-hand experiences.

Best tips for building massively scalable web API platforms