On Breaking Things


tl;dr: SparkFun was down for a while last night, while we worked on some database stuff. We should get better at uptime, but right now we're mostly trying to get better at databases. This morning we realized we'd broken currency conversion, and thus the display of prices, on the site. Last night's downtime message / web nerd musings and some post-mortem on what we messed up follow the llama, for those interested.

llama

So tonight, sparkfun.com's going to be down for a while. We're hoping for no more than an hour or two, although it could very well run longer.

This is where I concede that we could really get better at keeping the site up all the time than we are right now. We're working on that bit, but right now what I want to ramble about instead is databases.

Since approximately the dawn of time (which is to say since some time last decade, before I worked here), SparkFun has run the core of its business on MySQL (and recently MariaDB). When I got hired in the fall of 2007, a lot of the important code was still split between a hacked-up osCommerce installation and a pile of MS Access clients. That stuff has been gone for years now, but the database itself persists. In places it's practically a fossil record of offhand design decisions made by other people writing a shopping cart some time in 2003.

This probably happens everywhere. Application code may change drastically, but in a working shop, the data store has a way of becoming the scaffolding around which (for better or worse) everything is built and must be maintained.

As it happens, when we first realized that we needed a better toolset than not-very-object-oriented old-school PHP, we made some decisions with long-term consequences:

  • We decided to keep using PHP instead of rewriting everything in Ruby.
  • We decided to write our own lightweight web application framework instead of using existing PHP projects.
  • We decided to write our own ORM layer instead of using existing libraries.

A lot of things informed those decisions: We didn't want to have to replace all the working software at once just to use better tooling for new code. Everyone on our team already knew PHP. The frameworks we looked at seemed lacking in comparison to the stuff we'd used in languages like Ruby and Perl. The available PHP ORM libraries were either heavier than we wanted or built on assumptions that didn't fit our existing (frequently terrible) schema. Finally, rolling our own sounded kind of interesting, and we had no idea what we were getting ourselves into.

(Ok, so in the interest of full disclosure, what happened is that I talked louder than people who knew what we were getting ourselves into.)

Fast-forward a couple of years during which we hire more programmers and SparkFun gets another half million orders. We're sitting around in the mountains drinking beer and talking about what we really should get better at and the subject of PostgreSQL comes up.

See, MySQL and its forks, for all of the really huge and amazing stuff built on them in the last decade, have got some drawbacks. On this point, I risk triggering a lot of strong feelings from the sort of people who have strong feelings about databases, so for the time being, let's just say that it started to seem like we'd have a better time writing business software on a database that embodied a more rigorous approach to formal types and relationships. And because of years of our nerd friends telling us we should probably just use PostgreSQL, it kind of seemed like PostgreSQL might be that database, especially since we already knew we were going to use something with open code.

So we had some more beers and we talked about it and we had some more beers and we talked about it some more, and maybe six months later we finally decided to go for it.

So that's what we're doing.

Sort of.

It turns out that, when you write your own ad hoc ORM layer against a piece of software like MySQL without the faintest idea of how far you're about to be in over your head, it's pretty easy to make a bunch of assumptions without even noticing them. Rumsfeldian unknown-unknowns, if you will, creep in at every turn. Like fish and the concept of water, people who work with a given data store for a long time begin to forget that certain abstractions are even objects of possible consideration. These bits of data are represented as strings in PHP? Well, that's natural, right? What else would they be? String comparisons in your database of choice are all case-insensitive? Well, heck, that's almost a feature, right?

So then you decide you're going to point things at a different database and you get a crash course in all of the stuff you were wrong about, and all of the things you made impossible to test, and all of the ways your implicit assumptions about the two or three or five different type systems you rely on across languages and representations of your model are actually pretty fractured and subtly wrong.

So what we're doing tonight is rearranging a bunch of MySQL schema and deploying a terrifyingly long list of code changes that get us about three quarters of the way to being able to actually migrate to PostgreSQL without all of our software exploding continuously. We sincerely hope. (As regards exploding software, I also have a lot of recently developed thoughts about how I'll never again write a large project in a dynamic language without just tons and tons of unit tests.)

I should really write something coherent about all of this, but I just realized we actually need to deploy the code and such now.

Anyway, thanks for bearing with us while the site's down. The colored wires and blinkenlights and such will be for sale again soon.


Postmortem: Ok, so deploying all that stuff went way more smoothly than expected. Then this morning I woke up to discover that we’d been showing customers who logged in after midnight MDT a price of $0.00 (or its equivalent in your favorite currency) for most products.

This is sort of an infuriating bug, because it comes down to a single line of code I personally missed changing while making a few thousand really mechanical, straightforward tweaks. Here’s what that actually looks like:

diff --git a/services/update_currencies.php b/services/update_currencies.php
index 8c59226..a504378 100644
--- a/services/update_currencies.php
+++ b/services/update_currencies.php
@@ -135,7 +135,7 @@ class ExchangeRateParser {
     }

     if ($row=CurrencySaurus::findByCurrencyCode($this->code)){
-      $row->value = self::$dbi->escape($this->multiplier);
+      $row->value = $this->multiplier;
       $row->last_updated = $date;
       print "updating record for {$this->code}: {$this->multiplier}...";
       if ($row->update()) {

That call to $dbi->escape() was always superfluous, but before it didn’t break anything. After changes we made to other libraries, it added a couple of quotation marks to the value, and we wound up stashing 0s when we updated all of the rates at midnight.

Of course, it’s not really as simple as that, because it turns out that:

  • Obviously out-of-range values don’t cause the currency update to blow up.
  • We cache currency conversion rates in not one but two places. One of which happens to be a user’s session.
  • Despite the fact that all of our business is done in USD, we convert every price we display, including the subtotals shown in the shopping cart and at checkout.
  • All of the code that touches this is, after 4 years of cumulative hacks, mindblowingly tangled and completely untested.

In short, this is like a textbook example of how not to do half a dozen things on the web.

Maybe we’ll start doing a regular “learn from SparkFun’s web development mistakes” series.

If you happen to notice anything else broken around here, do me a favor and leave a comment below. And thanks to everybody who has been super chill about the breakage. You all are approximately the nicest and most forgiving customers a nerd could hope for.


Comments 55 comments

  • I, for one, really appreciate the transparency and true open source mindset you guys have toward literally everything you do. Two thumbs up and a pink feather boa from me.

  • I love getting to read this stuff. It reminds me that we’re all human. As my favorite office sign says “Let’s make better mistakes tomorrow”

  • Brennen for pres. I’ve seen Brennen pull some startlingly quick updates and implementations out of his hat so quickly that mere mortals drop their jaws. I figure, mistakes and all, you’re still top notch.

  • Your perception of your customers as “the nicest and most forgiving customers a nerd could hope for” I believe is because we’re the type of tinkerers who have been in similar situations ourselves. Maybe not with a database - Maybe with a broken network infrastructure, a complex firewall configuration, a PLD design, or maybe something not even IT related, and in our own experience, there was nobody else around who could even understand the situation, let alone help solve it. We enjoy having a “virtual slice of pizza and jolt cola” with you during your all-nighter, and like to hear the steps you go through to fix it, and it’s good to hear that there are other people working in environments which when looked at with perfect hindsight, you can honestly say “That is a dumb design, I’d NEVER do it that way” when in fact over the years, you did.. you designed and built it that way, because without that hindsight, it was the right thing to do at the time, and now you’re trying to deal with it. It is reassuring to know that we are not alone in that!

    It is this kind of post that keeps many of us coming to your site, day after day, week after week, even if we’re not in the middle of a project, and don’t need any parts. Keep up the transparency!!

    • Well said 172pilot. It’s easy to be critical of the way people have done things, I’m terribly guilty of it myself, but when I read stories like this and comments like yours, I realise that I too make those mistakes all too frequently and shouldn’t be too critical of others that make those mistakes. It’s especially the case when, as you say “there was nobody else around who could even understand the situation, let alone help solve it” - a situation I frequently find myself in. When you are developing new things in software, you are often doing it just to achieve one particular feature, and doing it a particular way often seems OK because it ‘got the job done’ at the time. But years down the track, that might cause all sorts of headaches and issues for you. It’s all too easy to be wise in hindsight.

    • I haven’t had a Jolt Cola since the 90’s. I’ve since quit drinking soda, but I’m pretty sure if I saw one in a store, I wouldn’t be able to resist. Do they still sell them?

      Personally, I find screwups are important, because they keep you humble. Not only that, but you’re far more likely to retain the lessons learned as well. That said, I would much have preferred that we didn’t have the 0 pricing hiccup.

      • You know.. I’m not sure if I ever actually DRANK one of those things myself either, but it definitely does represent the traditional idea of what the geek all-nighter was back then.. :-)

        I was curious enough though, and did find that yes… they still make it, and it even has Jolt COLA - “yes, they still are making it” in the description on Amazon :-)

        I agree on the “screw-ups are important” - My boss used to say “How can I expect you to fix it, if you don’t even know how to break it”, and that really is true..

  • As an IRC user pointed out, pricing has too many Significant Figures. I believe Josh is working on it.

    • I think there is actually a fix sitting in the pipeline waiting to be approved already, but either way it’ll get fixed :)

  • Great write up, and for that you deserve a t-shirt :o)

  • DIY always has its issues! But that’s why it’s fun!! You guys are awesome, and reading all this reminds me that there are still companies out there with a heart. Keep doing what you do!

  • And here I thought I was able to buy 1,000,000 Arduino’s and Rasp. Pi’s for free and no shipping costs! Dang it. Off to look for a second, third, fourth job to pay for the purchase! j/k :)

    • Sorry, I had already click-and-dragged “Sparkfun Electronics” into my basket, so there was nothing left.

  • MySQL with master-master replication works very well for achieving five nines. Database schema changes can be done with zero downtime by performing them on one half of the pair at a time while juggling the application as appropriate. I know because I deployed such a system during my six years working for an online retailer with $10M in annual revenue. I wouldn’t be nearly as worried about switching to PostgreSQL as I would be about getting off of a PHP platform using homebrewed ORMs and web frameworks.

    PHP: A Fractal of Bad Design

    • PHP: A Fractal of Bad Design

      PHP is, all around, a pretty terrible language. (There’s also PHP Sadness, which catalogs the complaints nicely by category.) I think people just starting out in web stuff would do well to learn pretty much anything else first, and considered in a vacuum, it’d be hard to justify using PHP for a new project. I could go on for a while.

      It’s also pretty much the least of our problems on any given day, isn’t going away any time this decade, and contains a bunch of working stuff we’re in no great hurry to write all over again.

      It turns out you sell stuff on the internet with the software you can make today. Often this is not the software that you’d like it to be, let alone the software that will satisfy the paradoxical constraints of avoiding being told by other people on the internet that you’re doing it wrong.

  • Brennen - The errors that you are having with currency amounts showing significant numbers for too many decimal places suggests that you are using floating point numbers to represent currency. This is generally frowned upon because most floating point representations trade off accuracy for the ability to represent a large range of values. If your system is using floating point currency, it could really mess with your financial transactions so thoroughly that you’d never be able to fully sort things out.

    • The use of floating point numbers for the calculation of the on-screen values does not imply that they are used for ALL calculations.

      It’s a good thing that multi-million dollar operations like sparkfun have someone like you who are able to point out the basic flaws in their accounting system without even looking at it.

  • All prices showing USD with 5 decimal places and trailing 002 eg: $10.00002 , mine’s set at GBP

    Browsers tested Google Chrome, Firefox, Opera, Safari and some old thing called IE8.

  • My Wishlist is dropping trailing zero’s in both IE and Firefox.

    +1 for Llama

  • Aww… I was kinda hoping it was a surprise free day. There was some mysterious tweet from the Sparkfun twitter account saying something like “go check sparkfun.com right now if you want to see why we’re down.” And there was no information there, except all the prices were $0.00. It was either a subtle free day promoted over Twitter or a bug. ;)

  • One word - Django.

  • I see $CAD prices as exactly the same as $USD prices (with 4 decimal places), which suggests that something’s still a little broken. :)

  • I’m seeing 002 on the end of all the prices ala $4.95002. To reproduce, go to “New Products.” I’m running Chrome on Win 7 64.

    • Thanks for the report! We’ve fixed this in our dev environment, however it won’t roll out until tomorrow morning.

  • Bug: My wish list total is missing the trailing 0 (shows $44.9 instead of $44.90).

  • What, no development server and database cloned from the current live server+database just for testing? It’s not like you are using someone else’s bleeding-edge software product. Think of it this way: SparkFun Inc. is currently using SparkFun Inc’s Developers AG’s software product. Does this make sense? You’re using an internal product with what appears to be minimal testing. Things under development should never go live as the core of your business model without testing it until you know it’s right and reliable. By having an internal-only beta server that is being actively hacked on, you can do off the wall things like try a different DB, simulate purchases and so on, and do everything that the production server does currently without actually doing it for real. You test it by shadowing the current activities by the main server on the development server. That’s the best test, shadowing live usage and seeing if the results agree. Smaller changes and verification of correctness are important. smaller unit tests of one-shot changes are important. Not testing is nearly unforgivable for a developer. That’s like soldering up a PCB and wiring up a circuit that you know will be plugged into a wall and draw a couple dozen amps without first checking for shorts.

    • What’d I say to give you that idea?

      Everybody who writes web code at SFE runs at least one VM that mirrors our production servers to a pretty good approximation, and is expected to be able to install the application from scratch. We mostly work against snapshots of our production db a day or two out of date. I think we’re currently running half a dozen dedicated beta installations so in-house users can test branches in various stages of development.

      We could be better at everything, particularly automated test coverage, but it ain’t exactly 1998 over here.

      • Let me ask you a question – do those production db snapshots contain cardholder data? Just curious how you manage PCI DSS compliance.

        • Nope.

          • Different database or do you use a black box solution like StrongAuth?

            • Not even the live production database from which we generate snapshots contains cardholder data. Whenever cardholder data is stored it is done so (encrypted) by our credit card gateway and processor - FirstData. We just store a processor-generated token for subsequent charges. It’s by far the most secure way of handling cardholder data because we just pass it through encrypted and never have to look at it, let alone store it. Full details of the program can be found here.

              • That’s good to hear. I wholeheartedly support black box solutions like this. StrongAuth works the exact same way (tokens).

    • Based on what [brennen] wrote it sounds like they do have testing/development environments. With regard to what you are saying it looks like some errors crept in during the development to production migration execution. I don’t know if that could have been caught by testing. E.G.

      A long time ago in a galaxy far far away I wrote some error check code to test if a hard drive had enough space to install our product. These were the days of 10 - 20M hard drives on MS-DOS. Pre 1998 brennen :) and yes the module was tested under a variety of hardware configuration assumptions. Except a customer called in and said “hey your product won’t install because it says I don’t have enough free space!” I ask the obvious “how much do you have?” and he replies “700M”.

      Thus [Bob] I learned that the real world is always in motion and what looked like a good size for an integer today, may be totally inadequate tomorrow! The other moral is don’t assume you know what the other guy has/or is doing…

  • Damn! I missed “Free night” at sparkfun!

    I used to be a huge fan of UI tests (I was writing Selenium tests for several jobs), except for the fact that they can be highly unstable, with delays and such you encounter when accessing websites either internally or on the internet. However, I think UI tests for basic functionality of a system, or “What do we NOT want to go wrong” scenarios (giving everything away for free) are good candidate for a UI test suite.

    Do you guys not have integration tests that would test your underlying services at a higher level?

    • We have a still-small but useful set of UI tests for navigating the site, placing orders, etc. Right now we’re using CasperJS for that. After today, we’ll probably be adding some sanity checks to price display.

      The rest of what we’ve got is mostly a set of unit-ish tests written in TAP. Some of them are expansive and cross-cutting enough to catch issues with third party APIs and our interpretation of that data, but there are clearly much bigger gaps than there really should be.

      Edit: And hey, if anybody’s looking for a gig… We’re maybe kind of rethinking how we want that job description to look, but at any rate automated test skills are a significant plus.

      • I’m guessing you folks don’t have remote workers. I just bought a house in WA. Of course half the fun of working at Sparkfun would be all the gadgets I could use in the off hours, and a bunch of people who totally know DYI.

        • I’m guessing you folks don’t have remote workers.

          Yeah, for better or worse, we’re really focused on having people in the office most of the time, and WA would be quite a commute…

  • Middle-click and control-click don’t work on links. I start to get edgy if I can’t have my tabs expand at a nearly geometric rate without a right-click. Sounds like there are bigger fish to fry, hopefully before the big fish fry, Friday.

    • Like Erik says, we’ve been trying to track that one down, although it seems fixed in recent Firefox builds. I just want to express my sympathy. I think I must have had about 25 tabs open at least once today. Live and die by that middle-click, myself.

      • Only 25?

        I just opened SparkFun.com for my lunch break. It’s tab number 121.

        I should really clean some of those up…

        • I tend to have about 50, the problem is once they scroll off the screen I tend to forget about them so anything more than that isn’t useful.

          • So I haven’t used this much yet, but I just came across OneTab for Chrome. Interesting approach to the problem - basically just condenses all (non-pinned) tabs down to a list on a single page when you click the button.

    • Can I just ask what browser version you’re using? I had the same problem with Firefox 21.0, but when Ubuntu upgraded it a few weeks back, the problem went away. We are still looking into it, I’m just curious if it’s a fairly isolated problem.

      • The sudoer has worked their magic and Firefox 22.0 now has middle-clickability at SparkFun!

      • Firefox 21.0 on 3.8.13-100.fc17.i686.PAE. I shall add this to the laundry list of things I want my sysadmin to fix.

        • Very much appreciated! Thank you. Maybe this is actually limited to just that one version of Firefox. Maybe.

          There’s obviously still something that makes it happen on our side, though, since the functionality doesn’t appear to be affected on other websites in that particular browser version.

  • Please clear cache and restart service. We got work to do and products to order, no time for troll numbers hehe

  • One thing I’ve been learning recently (and applying for the very first time!) is to test everything. The extreme is to write the test first and code to satisfy it second but I don’t always do that. I’ve been re-writing everything in PHP to be object-oriented just so I can easily test it. I discovered PHPUnit and it’s database plugin for testing out my code. I can use that to automatically create a mock database with fixed data for easy testing of my code. I then just regularly run the phpunit command in a script before I commit and as part of a Jenkins job that polls my git repository for code changes when a hook script fires.

    http://phpunit.de/manual/current/en/index.html

    • The design pattern is formally called Test Driven Development. The only problem with TDD is that if you don’t have a formal QA team to write the tests that are separate from the people implementing the code, then often you wind up with developers hacking code so that it fits the eccentricities of the test (not always a bad thing) or altering tests so that they pass the code (a very, very bad thing). You also can tend to wind up with a lot of truism tests rather than ones that actually explore edge-cases because you’re so familiar with the code base that you subconsciously start thinking “well no one would ever use this like this…” As a general rule of thumb programmers should always hit themselves if they catch themselves doing that. Someone is always going to use your code in a way you did not intend them to.

      • you subconsciously start thinking “well no one would ever use this like this…” As a general rule of thumb programmers should always hit themselves if they catch themselves doing that.

        Amen to that. I’ve developed the habit of never trusting anything that I receive from a client/browser/evil gremlin. And if the check for a should-be-obvious-and-always-false condition returns true I either spit them back to the landing page or destroy all the things and make them start over. (That’s what they’ll get for trying to pass an ID that isn’t under their account. :-) )

    • We’re working on having something similar to that set up, and we already have a fair number of scripts that check both code and functionality for correctness. That said, we still have a long way to go on this, and what screwed up the pricing was something based on a service that doesn’t run on our test environments (but probably should). Everything looked great when we deployed and the first orders came in, so we went home… and then the service ran. And everything went wrong.

  • One more bug, prices on this page (in US) show up with 5 digit cents: https://www.sparkfun.com/categories/242/

    • And four on the home page. But I noticed that it would always load like that, before my browser dropped the trailing zeros. But they are staying this time.

    • Reminds me of Office Space!

      MICHAEL It’s pretty brilliant. What it does is where there’s a bank transaction, and the interests are computed in the thousands a day in fractions of a cent, which it usually rounds off. What this does is it takes those remainders and puts it into your account.


This Week

This Month

Heartbleed

Happy Arduino Day!