Fault tolerance, graceful degradation - and failing in a not-so-painful way.
That’s a pair of Fetco CBS2131XTS-3Ls - they are our new coffee machines. They are massive and shiny - and a huge step up from our old Newco Machine. Why are we talking about coffee machines? Well, take a look at the screen that popped up on the one on the left a few weeks back…
“Internal error in file: qa_fifo in line: 81. For Service call: +1 (800) 338-2699”
One otherwise normal morning, the machine came up with this error - which seems not great. What does this mean for my morning coffee? DOES THIS MEAN NO COFFEE?!
After filling our cups from the second machine, we calmed down and did some research. Apparently we weren’t the first people to encounter this error message. The bit of code in question can be found here.
Turns out, this was a problem that popped up during the initialization process. The fix was - as it is in so very many cases - turning the machine off and on. As an aside, don’t you wish life was like that? Crappy day? Turn it off and on and start over! But I digress…
This, like our dear coffee maker, is not a graceful failure. (Image courtesy of mhm.hud.ac.uk)
The purpose behind this post is to start the conversation about failure – and how we should design technology to fail gracefully. Because, quite frankly, you need to be prepared for your design to fail. I know, I know — your code is perfect, your hardware choices impeccable and you are thorough in your assembly and review. But it happens to even the most well-designed and well-built projects.
This particular coffee machine failure is not what we would call graceful. The machine went wonky and it popped up with a fairly unhelpful error message - and then it didn’t work at all. No coffee brewing here, folks. If I want a cup of joe, I should call that number, I guess? Do they deliver it by drone?
Before we dive any further, an open letter to Fetco:
Dear Fetco, Seriously? Like...seriously? Sincerely, SparkFun
This isn’t just a normal coffee maker. It has an init file, for heaven’s sake. It’s also a very expensive coffee maker. We are sort of particular about our coffee and we drink a lot of the stuff. We think Fetco could have done a lot better. For example, if an error like this one kicked the brewer into a “You don’t need the touchscreen - you just need caffeine” mode. If this is the promised land of the “Internet of Things, I’m not sure I want to go there.
The “Red Ring of Death” (image courtesy of Wikipedia)
Unfortunately, clumsy failures are all over the place. For example, the “Red Ring of Death” failure on the original Xbox 360 game console. While Microsoft never really admitted to what was causing the issue, third-party tests basically narrowed it down to an overheating problem caused by the graphics chip. The unit became so hot that it de-soldered itself from the PCB. Whether it was really a problem with the chip or with the Xbox’s cooling system is debatable, but when it failed, it failed hard. There were all kinds of crazy fixes - including wrapping your Xbox in towels so it got so hot it re-soldered itself - but the issue effectively bricked your system. Maybe when the system started to overheat, it could’ve shut down - thus preserving your ability to “live to fight another day?”
Designing something to fail gracefully - sometimes called “graceful degradation” or simply “fault tolerance” - means that, as the design fails, its core functionality remains useable or the error at least creates as little collateral damage as possible. Wouldn’t it be nice if when the hard drive in your PC went kaput, it didn’t take all your data with it?
On the other hand, an example of something that fails gracefully relates to the “Transmission Control Protocol,” which allows reliable two-way communication in a packet-switched network, even when the communication links are jammed up.
Designing something to fail gracefully is often a case of what I call “design introspection.” Despite your prodigious skills as a coder of code and hacker of hacks, if something were to go wrong, what would it be? And let’s just put it this way - something is going to go wrong. Something will break. When it does, how do you want it to look and sound? What should it do? How should you “design for failure?”
It can be hard to look at a project objectively – especially when you’ve invested a whole bunch of time, effort and money in the build. But if you can identify the most likely culprits for a would-be failure, you can add redundancies, self-stabilization procedures (ooh, what’s that?) or failsafes that would make a potential failure – even a big/catastrophic one – not as bad as it otherwise would or could be.
What could possibly go wrong?
It’s also helpful (and, frankly, imperative) to identify your design’s “core function.” Which is exactly what it sounds like – at its heart, what is this thing supposed to do? In the case of the coffee maker, that’s pretty easy - it’s supposed to brew coffee. Protect the core function! If your fancy device has a bunch of extra features (say, a touchscreen), that aren’t necessary to do its job (say, brew coffee - AHEM, Fetco!) then if these extraneous things stop working for whatever reason, try to preserve the core function.
In the case of the coffee brewer gone rogue, what would this have looked like? The error itself was pretty minor – pretty much just a hiccup during boot. No one stuffed an old sock in the grinder or punched a hole through the screen in a caffeine-deficient rage. But it rendered the coffee pot unusable. In this specific case, it would’ve been ideal if there was a manual override or safe-setting. Fetco failed to protect its products core function - and that stinks.
No LCD screen – but also pretty darn reliable. Decisions, decisions… (Image courtesy of peak62.com)
With all of this in mind, we’re wondering - what hardware failures have you encountered that were decidedly ungraceful? What processes do you use to design your projects to fail with all the elegance of a ballerina?
We’d love to hear in the comments below! Now go forth - and fail well!