For the DevOps, this is where the nightmare begins, unless there is a specific policy of falling back or degrading the service gradually. We call it – Software Systems Triage, a term borrowed from ER and should give you an idea what we do in that case. As in any ER department, you prioritize the tasks at hand and try to get the best results with at-hand resources. Normally, the whole application is served by many services/micro-services and it could happen that some of them are down for unforeseen reasons. I am not talking about normal system designs where the services are in a load-balanced and the overload is supported by spinning more of them on different computers but mostly the case where the “contamination” is well-spread over all instances for “some-strange” reason.
You can imagine, an upgrade which was made in haste and not properly tested, some firmware upgrade of your networking equipment which went wrong or even, the unwanted DDOS on your services.
Most of the time, you may want to hide the problem but is quite obvious for your customers that something is not functioning properly. That is the moment where the triage plan is coming into place, if you have it.
Our plan in that situation, would be:
- LET YOUR CUSTOMER KNOW THERE IS A PROBLEM, BE OPEN ABOUT IT !
- on your messaging board, let your customer know about the problem and give them an ETA (if you have it)
- based on the metrics of your monitoring system, draw a conclusion of what services are really down, not performing at capacity
- at this moment you should have a matrix with what services can run when a component is shut down;it is the image of your application running on crutches
- decide which services are going to be interrupted without bringing your whole system down (it might be the email service, it might be the notification, reporting or real-time maps, etc..). Shut them down and let your customer know of the services which are affected.
- normally, if there is a dependency between services the other services should detect when a component is not present (ping/heartbeat/measuring traffic/etc..)
- at your high-level app, some feature should be grayed-out, or parts of the screen not showing up, but everything else should be working as expected
- call a all-hands-on-deck meeting (DevOps/network admins/Developers) and start the analysis of your logs/monitoring systems
- once you detected the problem and tested it in “the stage” environment you may release it in production
The whole discussion is under the assumption the software design is split in stand-alone services which can function when there are software components shut down. The software design should start with what is the minimal running configuration and include “what-if” scenarios when components may be down – email servers, message queues, third-party APIs, etc…
Summary: Include in software design cases where services are down and prepare a matrix with what is acceptable for your system to run “on-crutches”.