In Which We Are Not Having Fun

Arduino Day was pretty successful. Unfortunately, everything is broken.

Favorited Favorite 0

Hey everyone.

As you might know, we had a pretty good sale on a bunch Arduino products on Saturday.

Well, it turns out it may have been too good. We smashed our previous record for orders in a day, set on last year’s Cyber Monday. Back then the high water mark was around 4,000 orders, and on Saturday we saw almost 8,000 orders flood into the system.

What the previous record for orders in a day looked like
What the previous record for orders in a day looked like

It’s also worth noting that we’ve made some pretty big changes to our database in the past few months - most notably moving from MySQL to PostgreSQL.

The issue we’re seeing today has to do with how we know how much of a given product is available. Availability of a product is a loose term and has to take into account how many physical units of that product we have but also how much of those units are spoken for on active orders. Active is also a pretty loose term for an order that hasn’t shipped. All of these terms are necessarily loose to accommodate all of the edge cases common to volumes we regularly see.

We were pumped about the move to PostgreSQL for many features afforded, but primarily materialized views. Building such things to keep track of available stock values really sped things up!

Until this weekend. Apparently having an order of magnitude more active orders in the system makes refreshes on our materialized view for stock take a long time, and this has led to timeouts with heavily diversified orders. So far today it’s been a long haul of optimization attempts to make things hum along normally again. We’re still hammering away. It’s a technical problem in a big system, so there’s no such thing as a quick fix.

As we continue to work on this sparkfun.com will continue to have spotty down-time. We’re trying our best to minimize this while fixing the problem at hand, so thanks for being patient.

Update: 3:10PM

A missing index and some other optimizations have sped things up some. We’re back to everything functioning again, but we’re watching things very closely.

Also, to be more precise about the issue that plagued us: Like most of our back-end systems our warehouse system (called The Flow) was where the problem started. With so many new orders in the system the most important thing was to be able to ship them, and it’s a complex thing to have thousands of orders with intersecting items that can be meted out to pickers and packers roughly in the order they were placed but only if they are paid (unless they’re paying on credit terms) and only if their items are in stock enough that other orders aren’t claiming that same stock. It’s a fun problem that begets a lot of run-on sentences. There’s a massive query in that system to get orders based on even more special picking criteria and that query was locking up, causing refreshes on the materialized view to stall, causing further timeouts down the chain.

Were we just the users and not the builders of this system the problem might have never happened. Or it might have happened and been impossible to fix without a paid support contract. Impossible to say. Either way, spending the day fighting this has not earned ire from the rest of the SparkFun crew that was left waiting for the breakage to subside. Patience was what we received, along with coffee, liquor, and Easter Candy (in that order). For that we are grateful. =)

Update 4:10PM

And now Tim brought us a keg of Easy Street for our efforts. Maybe we should unintentionally break things more often…


Comments 41 comments

  • Big round of applause, thanks and encouragement to Chris, Mike, Steven, Nick, Casey, and the amazing men and women who run our site. It’s not always easy, but they are awesome.

  • Nick / last year * / 7

    To get more specific about the issue, we learned that in postgres, an “Access Exclusive Lock” on a table will block all future “Access Share Lock” requests, even when the exclusive lock has not yet been granted and there’s an existing Share Lock with long-running queries executing. We’re looking forward to PostgreSQL 9.4, where we’ll be able to utilize a new feature known as concurrent matview refreshes.

    We use a trigger to update our materialized view whenever a product quantity needs to change. Unfortunately, this could lead to the following scenario:

    Someone places / picks an order, which wants to update the quantity on 10 different products

    • We have a long query of some kind running, so when product #1 triggers a matview refresh, it requests an Access Exclusive Lock on the materialized view, but it’s not yet granted

    • Some time passes, and another random long-running query is enqueued after the matview refresh, along with hundreds of other normal queries (product views, logins, etc). These are all blocked from executing due to the Access Exclusive Lock being present, even though it is not granted

    • Long query #1 finishes, releases its share lock (~5 sec)

    • Matview refresh is granted the exclusive lock, is executed, and releases it in a timely manner (~150ms)

    • Long query #2 starts, joins/creates a share lock

    • In the meantime, our php backend fires off the stock change for product #2

    Rinse, repeat, timeout =(

    One update could cause future queries to hang for seconds, if not more. Stack multiple product quantity updates together from different sources, and we’d have periods of mass timeouts.

    Casey was able to optimize one of the long running queries to prevent the above from causing timeouts, however we’re reworking things and optimizing our postgres locks to prevent this from ever happening in the future.

    • You have given me a timely example to give to one of my classes today. This is an example of a synchronization problem we call the Readers-Writers problem ( for a quick explanation, see: http://en.wikipedia.org/wiki/Readers%E2%80%93writers_problem ).

      In this case, the Exclusive Lock is giving priority to the writers, so the quantities in the system can be changed.

      Thanks!

  • It turns out that it helps to have indexes on things.

    We should be up and running at this point. Maybe. Probably.

    • Do you have some SQL tools that will allow you to analyze your queries? If so, ensure that your commonly executed queries are indeed making use of your indices. Sometimes a database’s query optimizer has its own ideas, and sometimes we humans know better!

      • PostgreSQL has a query analyzer that’s quite explicit in its output. In our case the locking query’s analysis is about 400 lines long, so it took some time to parse through it.

        • Database optimization was the most convoluted, complex course I had in college. It is more mathematical than you could imagine it to be. There are people who make a living optimizing databases for this very reason- lost productivity. I would like to suggest a simpler query on the update. 5 seconds of lock is very very very high. Sounds like a Database Manager position is warranted.

          And just to help you guys out, take your time on my order, no pressure :)

  • Thanks so much Sparkfun. I am one of those that enjoy the challenge that you guys created with this kind of a sale. And I think you guys did a great job responding to the self induced problems. Thanks for the sale, and I can’t wait to get my new arduino’s.

  • You guys continue to be rock stars! An amazing day, great work!

  • Best wishes for your Sysadmins and DBAs… (it’s most of my dayjob too)

  • What a great position to be in… overwhelmed with orders. Good luck catching up on things. Thanks for the awesome sale.

  • Any updates?

    • We’re still getting through back orders, but we are getting things processed as quickly as possible! If you have questions about your particular order, please contact customerservice@sparkfun and they’ll be able to give you the most up-to-date information.

  • All I can think to say is Thanks!

  • How do I send a case of beer or other beverage to you folks? Can I do it over PayPal? Or should I just buy more stuff?

  • Obviously you are scrambling now to meet all of the orders. Do you have an estimate of how long it will be before the backorder queue will be eliminated? I ordered some 5v Arduino Pro Minis (in my case, order id 933786), and my order is flagged as an exception. I was just wondering if the wait time will be days, weeks, or months?

  • When you can, an idea on when you expect to be caught up would be lovely, but I suspect all of us can understand it being “It’ll ship when it ships” today :) Good to help you debug things, I’ve been there too, glad you have good co-workers :)

  • You see sparkfun, now you must post new pics of the new sale shipments. :) good job! Can’t wait for my 4 pro minis, 2 redboards, and arduino uno for my schools club!!!

  • Can’t wait for my new arduinos and start teaching my daughter all about the fine arts of creating / inventing!

  • Another un-intended side effect of the sale. Because I chose to ship my order all at once when avaliable, You system placed a charge of $0.00 on my credit card which caused my bank to lock down my card until I called them (I found out after I went to the grocery store to pick up a few things Luckily I had cash!)

    I was surprised. Obviously they didn’t look at my card history or they would have seen dozens of other orders from sparkfun in the past. It was actually funny, they guy who answered the phone said he uses Sparkfun as well! Still worth it for all the arduinos I ordered lol

    • I had a similar problem… Only without the cardholder locking my account (whew… Or maybe hmmmm I may need to switch banks…) Luckily I was able to pay the balance due on my credit card today and get them backordered!

  • Wow, how many back orders did you get on the arduinos?

  • I placed an order Saturday evening around 10pm. Aside from my general disappointment of non awareness of the impending sale, and having paid full price for a red board (life Goes on, they say) will these technical issues be delaying the shipment of my order? Expected to see an email yesterday but didn’t.

    You guys are awesome, btw. Much sympathy in your time of need

    • will these technical issues be delaying the shipment of my order?

      In your case, yes, probably. You got an order in while we still had stock of the RedBoard un-allocated to orders, and our shipping department was unable to work for a good chunk of yesterday. This means at least a little delay for every shippable order in the system, but they’re churning through ‘em pretty fast now that things are up and running.

      For folks who have backorders of Arduino day stuff, this shouldn’t make much difference. The blocker to those orders shipping is mostly just buying, building, and receiving all that stuff.

      Did I mention we’re also about to do fulfillment for another super-successful KickStarter? It’s been an interesting year so far.

  • I hope you’re using stored procedures (or are very militant about execution order in SQL logic) or you’re bound to have some nasty deadlocks.

  • What is the average number of orders made each day? Anything near the previous record of 4,000?

    • A good day is around 800 orders or so. Thus an 8,000 order day is literally an order of magnitude greater.

  • Which box was my order?

  • That picture does not look like 4000 orders. More like 1000 or less. Ah ha! Research reveals there are no USPS packages in the picture. I want to see the picture of everything from Saturday that is going out (even though my stuff is on backorder). That would be cool.

  • Go Go Gadget Database!! [no idea why I thought that, but because I did I wrote it]

  • Thank you for giving me the opportunity to help “stress test” the new system! :-)

    I’m looking forward to seeing what you do to get the one-day order record up to 16,000. ;-)

    • I too await a chance to stress test your system to OVER 16,000 items. Wait, how many items are in a one-day sale that will still fit inside you new diggs?
      I remember one company I worked for who’s first month sales was 3 times the forecast sales and we used our pickup trucks to pickup supplies until the distributors bought a new truck! Glad to hear it was a hit, even thought something hit the fan. signed a big COM-11718. :-)

  • Performance optimization is a great way to get to know your system in new ways. Glad we were able to help ;)

  • Just Think if the website was done in FrontPage :)

    • Absolutely true story: When I started working here, product descriptions were still frequently written in FrontPage. Just before I started working here, I’m pretty sure they were written in FrontPage and pasted into an Access database.

      • at least it was not done in PowerPoint for the Product descriptions, I kid you not I was asked to look at a Local Company’s website back in the day, and they had their Son design their website. All done in PowerPoint saved as a HTML files.

  • Reminds me of the day we were doing installer testing at work, and we brought the local network down (nearly caused hundreds of thousands of automated regressions to fail prematurely).