Fault tolerance in microservices
One fine day – or in fact, not-so-fine day – our brand new microservice-based app stopped functioning during off-hours. When we started working, we got to know the shocking news, and then started the investigations aka debugging! After hours of impatient slogging, the root cause came out to be a micro-service that would write to a telemetry service! It went down for an unforeseen reason, bringing down the whole cobweb of micro-services!!
We, as human beings, have the natural power to prioritize things and leave out the non-essential or less-crucial ones as a trade-off when there’s not enough time or resources to perform all nice-to-do tasks. Computers have to be programmed for this. Micro-services are no exception. In the above case, it would help if the other services simply stop using the telemetry service. Telemetry data is used for cold-path analysis, and just trends are required from this data, rather than precise values.
Fault tolerance is built-in micro-services keeping in mind this ideology — do everything, but in case of errors, perform the core functions leaving aside the nice-to-have ones. In particular, here are the techniques which build fault-tolerant micro-services.
- Retries and timeouts: These are well-known measures typically taken by the callers of a service.
- Exponential back-off: This measure is taken by a caller of a service. If a service is not functioning well, call it less frequently. This reduces the load on the resources to handle similar errors again and again. Increase the time duration for the next retry each time you receive an error.
- Circuit breaker: This helps in case a service is failing and callers enter into the retry mode which increases the load further. The circuit breaker is a small service situated with the main service which identifies the proportion of failing calls, and when failures exceed a threshold, it returns failure for a timeout period without calling the main service. After the timeout period, it allows a small portion of calls to reach the main service and if they succeed, all the traffic is resumed.
- Rate limiters: A service limits the number of requests it can serve per given time period or per client. The rate-limiting helps a service from being overloaded at any time. It’s a preventive measure. The rate limits can be static, written in a config file, or dynamic, calculated by another small service based on the load on the system, the number of requests currently in the queue, etc.
All these measures help in automating the failure handling in microservices and keep most of the system up and running most of the time.
Which of these measures have you implemented before? Do let us know.