Designing for Graceful Failure

Fault tolerance, graceful degradation - and failing in a not-so-painful way.

Favorited Favorite 0

alt text

That’s a pair of Fetco CBS2131XTS-3Ls - they are our new coffee machines. They are massive and shiny - and a huge step up from our old Newco Machine. Why are we talking about coffee machines? Well, take a look at the screen that popped up on the one on the left a few weeks back…

alt text

“Internal error in file: qa_fifo in line: 81. For Service call: +1 (800) 338-2699”

One otherwise normal morning, the machine came up with this error - which seems not great. What does this mean for my morning coffee? DOES THIS MEAN NO COFFEE?!

After filling our cups from the second machine, we calmed down and did some research. Apparently we weren’t the first people to encounter this error message. The bit of code in question can be found here.

Turns out, this was a problem that popped up during the initialization process. The fix was - as it is in so very many cases - turning the machine off and on. As an aside, don’t you wish life was like that? Crappy day? Turn it off and on and start over! But I digress…

alt text

This, like our dear coffee maker, is not a graceful failure. (Image courtesy of mhm.hud.ac.uk)

The purpose behind this post is to start the conversation about failure – and how we should design technology to fail gracefully. Because, quite frankly, you need to be prepared for your design to fail. I know, I know — your code is perfect, your hardware choices impeccable and you are thorough in your assembly and review. But it happens to even the most well-designed and well-built projects.

This particular coffee machine failure is not what we would call graceful. The machine went wonky and it popped up with a fairly unhelpful error message - and then it didn’t work at all. No coffee brewing here, folks. If I want a cup of joe, I should call that number, I guess? Do they deliver it by drone?

Before we dive any further, an open letter to Fetco:

Dear Fetco,

Seriously? Like...seriously? 

Sincerely,
SparkFun

This isn’t just a normal coffee maker. It has an init file, for heaven’s sake. It’s also a very expensive coffee maker. We are sort of particular about our coffee and we drink a lot of the stuff. We think Fetco could have done a lot better. For example, if an error like this one kicked the brewer into a “You don’t need the touchscreen - you just need caffeine” mode. If this is the promised land of the “Internet of Things, I’m not sure I want to go there.

alt text

The “Red Ring of Death” (image courtesy of Wikipedia)

Unfortunately, clumsy failures are all over the place. For example, the “Red Ring of Death” failure on the original Xbox 360 game console. While Microsoft never really admitted to what was causing the issue, third-party tests basically narrowed it down to an overheating problem caused by the graphics chip. The unit became so hot that it de-soldered itself from the PCB. Whether it was really a problem with the chip or with the Xbox’s cooling system is debatable, but when it failed, it failed hard. There were all kinds of crazy fixes - including wrapping your Xbox in towels so it got so hot it re-soldered itself - but the issue effectively bricked your system. Maybe when the system started to overheat, it could’ve shut down - thus preserving your ability to “live to fight another day?”

Designing something to fail gracefully - sometimes called “graceful degradation” or simply “fault tolerance” - means that, as the design fails, its core functionality remains useable or the error at least creates as little collateral damage as possible. Wouldn’t it be nice if when the hard drive in your PC went kaput, it didn’t take all your data with it?

On the other hand, an example of something that fails gracefully relates to the “Transmission Control Protocol,” which allows reliable two-way communication in a packet-switched network, even when the communication links are jammed up.

alt text

Jammed!

Designing something to fail gracefully is often a case of what I call “design introspection.” Despite your prodigious skills as a coder of code and hacker of hacks, if something were to go wrong, what would it be? And let’s just put it this way - something is going to go wrong. Something will break. When it does, how do you want it to look and sound? What should it do? How should you “design for failure?”

It can be hard to look at a project objectively – especially when you’ve invested a whole bunch of time, effort and money in the build. But if you can identify the most likely culprits for a would-be failure, you can add redundancies, self-stabilization procedures (ooh, what’s that?) or failsafes that would make a potential failure – even a big/catastrophic one – not as bad as it otherwise would or could be.

alt text

What could possibly go wrong?

It’s also helpful (and, frankly, imperative) to identify your design’s “core function.” Which is exactly what it sounds like – at its heart, what is this thing supposed to do? In the case of the coffee maker, that’s pretty easy - it’s supposed to brew coffee. Protect the core function! If your fancy device has a bunch of extra features (say, a touchscreen), that aren’t necessary to do its job (say, brew coffee - AHEM, Fetco!) then if these extraneous things stop working for whatever reason, try to preserve the core function.

In the case of the coffee brewer gone rogue, what would this have looked like? The error itself was pretty minor – pretty much just a hiccup during boot. No one stuffed an old sock in the grinder or punched a hole through the screen in a caffeine-deficient rage. But it rendered the coffee pot unusable. In this specific case, it would’ve been ideal if there was a manual override or safe-setting. Fetco failed to protect its products core function - and that stinks.

alt text

No LCD screen – but also pretty darn reliable. Decisions, decisions… (Image courtesy of peak62.com)

With all of this in mind, we’re wondering - what hardware failures have you encountered that were decidedly ungraceful? What processes do you use to design your projects to fail with all the elegance of a ballerina?

We’d love to hear in the comments below! Now go forth - and fail well!


Comments 45 comments

  • Sooooooo no one called the number? It gave you one job, and no one called the number?

    • Why call the number, wait three days for service, and pay $500 when a power cycle fixes the issue? The whole point is the unit failed clumsily. It could have rebuilt the file. Or, if it was an unrecoverable error, wait until the number of failures crosses a certain threshold before waving the white flag. Increment a counter stored in non-volatile memory and power itself off. Once counter surpasses five, show the error. Since a power cycle fixed the issue, the counter is cleared, and no message would have been shown. It’s called UX (user experience) design.

      We’re engineers here. We don’t take clumsy failures very well. If this had happened to me, I would have found it very difficult to not attribute it to a deliberate attempt by the manufacturer to drum up unnecessary service work.

  • Try a kitchen range hood. Some idiot designed what could be done using 6 transistors, an LCD clock and a handful of other discretes (3 latches, basically) with a little MCU instead. For a two-speed range hood fan and light with a clock. Go to turn it on while cooking dinner; nothing. Clock frozen, steam condensing on the walls and ceiling. I had to go toggle the kitchen breaker because they left out a reset button.

    That’s rather unacceptable. Ask yourself while designing two additional questions: Can I X? and: Should I X?

    Yes, you CAN use an MCU for a range hood. No, you probably SHOULDN’T.

    Design perfection is not when there is nothing left to add, it’s instead when there is nothing left to take away.

    • That’s exactly what the watchdog timer in the MCU would take care of, but the person programming the unit has to use that feature in order for it to work. Silly programmer!

  • Side note: I wrote a guide to fix the Xbox 360’s RRoD:

    http://www.instructables.com/id/Fix-the-Red-Ring-of-Deathwithout-towels/

    It’s the top result on Google Search and about to break 3.6M views. My biggest achievement on the Internet.

    • Oh that’s awesome! When I was looking for the image for this post, I came across your tutorial - should have known it was a SparkFunner! :) Nice work!

  • I have seen a microwave oven that runs the fan for a little while after use, presumably to cool down the magnetron and transformer. However, said fan is on the same circuit as the light, so the microwave appears to still be running to someone not used to that model. (Albeit with the timer displaying “0”)

    One failure mode I have seen on microwaves in the past is that the relay, or it’s driving signal, became stuck on, resulting in the microwave “cooking” constantly whenever the door was shut. With a normal microwave, this would be quickly noticed because the fan and light would also go on when you shut the door.

    But if you became accustomed to the “fan cooldown mode” the microwave would APPEAR TO BE WORKING NORMALLY EVEN THOUGH IT IS SUFFERING A CATASTROPHIC FAILURE! In other words, you could be standing right in the kitchen, look directly at the runaway microwave, and then think it has stopped heating and walk away!

  • Abort, Retry, Fail?

  • Wow, that’s kind of sad. This is also why I got the cheap coffee maker with a switch and neon lamp instead of one with a bunch of buttons, not only because the extra $10 was good for buying coffee.

    I was just going through this kind of rant in my head this morning, due to an overpriced screensaver that likes to “crash” to desktop with an unhelpful error message about something wrong with the 3D engine and then sit there… not restarting because as far as Windows knew, it was still running. When I was asked to set it up to run constantly for public display, the first thing I did was say “NOOOooo it’s going to throw error messages on your screen and make your place part of an ongoing bad joke” but they insisted. The next thing I did was to insist on a laptop for power saving considerations, and I found an i3 on eBay that wasn’t entirely presentable, but cheap… it turned out to be an i5 because someone goofed, so that’s neat. Then I installed Tiny7 to avoid all the helpful things that Windows would otherwise try to do or send up on the screen, to have a nice bare minimum of background tasks, and to make reboots predictably and consistently fast. When it was all set up, I disabled the wifi and left the ethernet unplugged, just in case someone was clever and malicious enough to get in, they still wouldn’t be able to deface the “sign”. I hoped, I really expected that I prevented all the Error’d sort of failures. I made a little happy bubble where stuff would Just Work and it wasn’t enough, because the main event, the centerpiece, is a piece of junk.

    After it started “crashing” I installed AutoIt to run in the background, waiting for that message box in order to close it. Eventually the AutoIt scripts grew to include several other messages including the error reporting form which, again, was actually part of the screensaver program and so would block Windows from the core function, preventing screen burn. I also had to schedule the OS to reboot itself every night because occasionally the screensaver would stop animating– it actually freezes without crashing or even throwing an error and the background sound it makes just keeps on cheerfully playing. Clearly, this was not designed or tested to run for more than hours at a time.

    Anyway, I am happy about AutoIt and I had practically gotten over the whole mess when it came up with a new message, just recently after a power failure:

    Your settings file has been corrupted, starting with defaults [ OK ]

    No, it’s not OK. IMO, any screensaver that waits for user input, while not animating and with no timeout, is failing hard. So, in the infinite loop of the primary AutoIt script, I added a line that would overwrite the settings.txt file from a ‘master’ before launching the program. We’ll see how long that lasts.

    <tinfoilhat>Side note, IoT is a plot to make every mundane aspect of daily life susceptible to EMP</tinfoilhat>

  • Ungraceful hardware failures? Okay…

    1) Honda Accord 2001 remote control, when battery is low, would send open/close car signals repeatedly instead of just not working, so you can’t really keep your car closed, even with the actual key. This just kept closing and opening car doors while you are near it, until we replaced the battery (the actual PCB inside looked just fine).

    2) VW Jetta A4 2002 automatic transmission would just stick in 1st gear or fail to switch without any warning or check engine light. How to fix it? Turn off the car and then turn it on again (a common transmission glitch for the model, apparently), which happens to reset AT chip and solve the issue.

  • Ford Sync. Try and update that thing. JUST TRY.

    • I broke something similar. Family road trip, tried to do voice commands while an iPod was plugged in playing music, and it would crash the entire windows OS on the car. (this included the displays and gauges for the driver).

    • I’ve done it twice now. Nothing like letting a 6.2L V8 idle in the driveway for an hour+ waiting for updates to install. The last update was worth it though—it solved most of the major issues with the system.

  • Chemex. ‘Nuff said.

  • I have to admit, AMD does graceful failure unexpectedly well. I had the misfortune of using a computer where the CPU fan was disconnected from the header on the motherboard. Either the wire wandered it’s way too close to the fan and it got yanked -or- a certain person forgot to plug it in.

    I used the computer for months that way - even playing games on it - and it worked fine. The only thing that indicated something was wrong is the system would unexpectedly slow down when the processor got near it’s thermal threshold. Once I corrected the problem, it was fine. Apparently someone at AMD knows all too well how good those ‘ball bearing’ cpu fans work.

  • For a consumer device, I would say a “good way to fail” would be: 1. In large type, put a very simplistic description of the problem. For example, in this case, it could be akin to “Sorry, the internal computer froze”. Or, if the computer itself thinks there’s an issue with the temperature sensor or some other part, it could say “Uh oh! The temperature sensor is busted!” 2. Don’t just say “Call (customer service)” for every error, but instead, give some “user maintenance” instructions, if applicable, (“Please turn the machine off and then back on” for a kernel panic) and then something like “If the problem occurs frequently, please contact (customer service)” 3. Put the nitty-gritty details below that message in smaller print.

    A good example of varying amounts of “fault tolerance” would be vending machines and arcade games. Protecting the core function would mean greater profits for the owner. However, if the core function is impacted directly by the failure, a better (but likely more expensive) design would detect such a fault and indicate it to users somehow (display “OUT OF ORDER” or “Err” on the display, turn off the machine’s lights, etc) Although the technology to implement such things has existed for several years now, vending machines are usually built to last and/or be repaired, so many “dumb” machines are still in active duty.

  • Here at Netflix thinking about and handling failure is a big part of what my team works on. I work on the API server (the front door for all devices) and our servers have to talk with 1000s of other servers. The probability that something is going wrong is pretty close to 1. One of the trickiest failure scenarios is when something doesn’t fail but just gets slow. It will quickly eat up all the CPU resources computing something that might not be necessary to just getting what the user wants done.

    We’ve written an OSS library to wrap code that might fail called Hystrix https://github.com/Netflix/Hystrix.

  • Super interesting thread!

    I would say the machine failed in a safe way. It did not start pouring hot water onto the floor, or at a person, it did not cause a fire in the coffee grinds, and it did not injure anyone. To me, fault tolerance is primarily about failing safely.

    • Very good point. Failing safely is the most important thing. After that is making my coffee ;).

      • I know nuclear weapons have many fail safes especially when it comes to component failures they are designed so that other parts will fail far before the others.

        How I manage to remember such things from long ago is beyond me, yet I can’t even remember someone’s birthday.

  • In my experience as a software developer, it’s not just about expecting failures and failing gracefully but also making sure the appropriate details of the failure is provided to the consumer. When I say “consumer”, I don’t necessarily mean the end-user of a device, either. A consumer can be a developer who utilizes a third-party API that may in turn use another API internally. Handling all of the possible error messages and encapsulating them appropriately for the consumer of the interface is important.

    In this case, that error message is appropriate for a technician, not the end-user of the device. The appropriate message to present to the end-user would be “Failure during initializing; please try again. If problem persists, call XXX-XXX-XXXX for service.” The details of the error can be made available via a secondary screen/button or a serial console.

    Now, consider how confounded you might be if the error returned was not formatted for a technician, either. What if the message displayed was the raw error returned by the kernel’s write() system call? “Error 32” Imagine how maddening that would be, both for the end-user as well as the technician and/or engineer whom have to debug the issue. The unfortunate truth is that sort of thing happens far more often than you think.

  • Here’s a plug for ‘Robust-First Computing’ (15 min video), which offers a big-picture view of such issues, though slanted more towards software.

  • Unfortunately, with the complexity of today’s systems one can’t always plan for failing gracefully. Even when a designer or coder plans for what they think is all possibilities, some gremlin always manages to pop up. Even a graceful failure in one part of a system may result in some ungraceful crash-and-burn in another part, particularly when the parts are being designed by different teams for later integration. Even with design standards, wrinkles occur. (Standards are great things! There are so many to choose from!)

    One thing I was taught (repeatedly!) in programming classes years ago (think FORTRAN and COBOL if you want to know how long ago) was to always make meaningful error messages (no “ERROR IJT225I in line 1234”). Even if the meaning wasn’t 100% clear to the user, at least it was more descriptive and made troubleshooting easier. Your coffeemaker is a good example. The error could have included “Try cycling the power” as part of its text. But engineers and programmers are usually looking for ways to make their jobs easier, and it’s an easy trap to fall into.

    And by the way… “YOU’RE” is a contraction for “you are” and “YOUR” is a possessive. Spell-check’s great but it misses things like “…preserving you’re ability to live…”.

  • OH I understand, it’s a generated code by quantum leap software. After looking at the product page and a video, I understand why they use a finite machine. I wonder if it is Fetco or quantum leap fault? Fetco not able to program and test the product OR quantum leap not generating the proper code?

    At least there is a SD card to update the firmware!!

  • Sparkfun, Seriously…like seriously. Reboot, get your coffee and move on. If the megadollar Fetco fails, have a percolator coffee pot on standby.

    Rant Complete.

    Sorry, haven’t had coffee yet today :)

  • …also missing an end quote above the RROD pic.

    So, I know quadcopters are old news, but I was thinking about placing IR reflection sensors under my propellers to quickly detect the failure of prop saver rubber bands. Although now that I think about it, keeping an eye on the current would be more reliable. It’s interesting the conversations about why or why not to build in failure detection. I guess if you’re sending people into space or building medical devices you think about that kind of thing. If you’re making… whatever that stuff we buy over and over again is called… hmmm. I guess it makes more money if it breaks?

  • In terms of fault tolerance, error recovery, and just about every other aspect of a design, I ask myself: “Does it matter to the customer?”

    Does the customer care how fancy my algorithm is? Does the customer care about some obscure error code, or should I just do something reasonable and keep going? Does the customer really care about a bunch of detailed status messages? Did I leave my fascinating debug code in place that stops a device working and pisses my customer off?

  • Sound like a meme to me" I don’t always fail but when I do, I do it in style" Does anyone remember the 404 frog ?

  • One thing that I’ve learned from the AVC is that the more complicated you make your bot and the more special cases you put in to make it handle different problems, the more likely it is to catastrophically fail. Robots and coffee makers need to do the basics and do them very well before they try to do anything at all complex.

  • Awesome article and spot on for my current project. I am using one of your sensors to count the number of folks using one of our local parks here in Raleigh. Everything works perfectly in my apartment but, install it in the park where the AT&T signal strength is a little weaker and…failure. I have spent the last month finding the latest way the device could fail to send its data wirelessly - fail to get GSM, fail to get GRPS, fail to get an IP address, fail to connect to Ubidots (my IOT provider)…. By definition there is only one way to succeed, but there seems to be endless ways to fail. And yet, there is a great deal of satisfaction each time I find an answer to the latest failure. And I am hopeful that there is a happy ending to this story - someday….

  • Remember that joke about a car with brake trouble & Bill Gates' suggestion was to push it back to the top and see if it fails the same way again?!

  • I’m at the SANS Cyber Defense Initiative in DC right now taking a class on Industrial Control System cyber security and it’s filled with these kinds of stories. Anyone see the wifi enabled crock pot with no (ok…limited) physical user interface? http://www.crock-pot.com/slow-cookers/wemo-enabled-smart-slow-cooker/SCCPWM600-V1.html

  • I know it sounds simple, but a well thought out power-up routine that checks all sensors and returns actuators to a safe condition is critical. I often see electro-mech devices that require manual manipulation after a failure, better to let the machine fix itself then us clumsy humans. BTW, I recommend reading any books by Henry Petroski, for insights on engineering failures and their root causes.

  • For more than 20 years, I wrote highly specialized compilers. I found that I spent about 30% of my time making sure that the compiler would give correct output when given correct input, and about 70% of my time making sure that the compiler would act gracefully when incorrect input was encountered. The ultimate test of this was, back in the days when mailtools used “flat” files, to feed someone’s saved e-mails into it and make sure it didn’t crash. (Without command line options to the contrary, it would give up after 100 errors were encountered.) It could be amusing to see it try to make sense of something like “Hey, George, do you want do do lunch today?” as a programming language statement.

    When I was in college, we used something called “Pascal 8000”, which the University had obtained for the cost of a tape from the Australian Atomic Energy Commission. Although we never found a case when it would generate bad code given a correct program, if it encountered a coding error in the input code, it would usually crash and give a traceback of the code. I did find one instance where the best I could figure was that it caused the big mainframe computer (an IBM 360) to crash due to a missing semicolon in my code. (Can you say “not graceful response”?)

    All of this leads me to my saying: “I don’t trust computers. I design them. I know better than to trust them!”

  • Another thing to keep in mind for graceful failure is what feedback your potential customers might appreciate. Being SparkFun, you guys can make sense of something like an error on a specific code line but other people probably can’t. Just how much info do you want to share with people about the inner workings of your widget?

  • That bridge that collapsed in the moth man prophecies was very inelegant.

  • One common example to think about is what your system should do when a memory leak causes it to run out of memory. Pride says that the system should not crash. But frankly, if it crashes and reboots (in <1min) every hour, that’s PROBABLY a lot better than just sitting there indicating a “no memory” error.

  • Sadly, I’d say that was a “fairly” graceful failure. At least it provided some information (as opposed to the display being there as a way to display the current time). At least you had some information to use in order to resolve the situation. IMHO (as a consumer and a programmer) graceful failure is an option that far too many companies quickly remove due to profitability issues (it takes $$$ to anticipate and code for a graceful failure, which reduces profit). In addition, there is also the issue (that you’ve covered before) of “where do we stop and call it a sellable product”. Also, if enough people complain, there is always the “new and improved” next version. Now, with that said, that same corporate attitude does keep me employed ;).

  • Imho, that could be a serius internal bug and something on which many parts of the system rely. In that case you can’t just go to a safe-mode as the system is “unstable”, like in a kernel panic. In that case the only thing you can do is a) reboot automatically b) show some debug info to help identify the issue

    I prefer (b) because i don’t want to hide issues (and if the thing reboots in the middle of something it’s not good at all). What i would do is probably write something like “Sorry, something went very wrong: please restart the machine. If you could call our support team and provide this data to help fix the issue it would be much appreciated”. Well, okay, probably not exactly like this, but you get the idea :)

    The only problem i see is this happend for a while without getting solved (according to the post).

  • I hope you guys provided steps to reproduce and sent it to the manufacturer so they could investigate and fix it (if they haven’t already) in the next version.

    As for a hardware failure of mine… I have had too many to count. The two significant ones is:

    1) A GTX 280 OC edition video card. It kept crashing my computer after anything video intensive started. After a few of these crashes, I brought it back to factory settings on clocking and things went well for a while, and then I started getting GDDR4 failures. Now, for those that have never dealt with any type of major DDR memory failures before, it is hard to pinpoint due to the randomness that is DRAM. So the only way to pinpoint the problem was to get a video game that required low CPU and high GPU functionality. Luckily, one of the games I tried was Command & Conquer: Generals, where it would display the problem but not crash. My fix? Recycle the card and get a new one.

    2) For my senior design project I used a Freescale S12XUF MCU. This chip focused mostly on storage systems and some GPIO (which I needed to make writing to an SDcard easier). After 2 months of trying to talk to it and program it, I contacted Freescale engineers to see what is going on. After about 3 days of discussion, they came back and said that the locking byte (to protect the chip from reverse engineering code off of it) was basically set automatically anytime you tried to communicate with it. What does this mean? It bricked the chip if you lose power/connection to it and try to establish comms again. They offered me a 6 replacements for free (with expedited shipping), but by then it would be too late.

  • I have a bit of insight on the RROD xbox failures. It was most likely due to poor solder quality on the BGA device, a phenomenon known as head-in-pillow, where the solder balls do not wet properly into the pads and you get two separate snowman-like solder balls that are electrically connected until heated or shocked where they break apart and will intermittently fail. Credibility: I’m a test engineer, and I had to deal with this manufacturing defect in a product last year.

    • you could potentially reflow the devices and try to get the solder balls to re-wet, but since they use lead-free solder, it’s unlikely that they will form one ball. At manufacturing, oxide layers can build up on the solder balls. Out solution was to change the heating profile and use Nitrogen in the oven to reduce oxidation. or something like that - I’m not a chemist.

    • The difficulty with such a manufacturing failure is you don’t exhaust resources to try and design fail-safes for unforeseen failure modes, you do the best you can at low volumes to make your yield as high as possible. Head-in-pillow is an incredibly difficult problem to find, and an incredibly difficult problem to completely fix. The best we can do is AT LEAST run a system check to make sure all your peripherals respond, and if something doesn’t, throw an error or blink an LED code. It’s much more helpful than just having the thing not turn on at all.

Related Posts

Recent Posts

Tags


All Tags