at a basic level there are three main causes of cloud services failure:
1. Device and infrastructure failures
2. Software vulnerabilities
3. Human errors
If we anticipate these failures will invariably happen – that indeed they are a constant threat – we need to design cloud services so that when something does go wrong, the impact to customers is avoided or minimized.
Note it doesn’t say, this won’t happen or this or that component won’t break or process will be perfect and no one will make a mistake or anything like that… Instead… anticipate that the failures WILL INVARIABLY happen.
Plan accordingly. Just like you do for everything in IT.
joe
The Netflix guys have a pretty good grasp on designing for failure, and their tech blog talks about some of the ways they do it. They also designed the Chaos Monkey to help trst their designs with repesct to handling component failures.