In Which We Are Not Having Fun

Arduino Day was pretty successful. Unfortunately, everything is broken.

Hey everyone.

As you might know, we had a pretty good sale on a bunch Arduino products on Saturday.

Well, it turns out it may have been too good. We smashed our previous record for orders in a day, set on last year's Cyber Monday. Back then the high water mark was around 4,000 orders, and on Saturday we saw almost 8,000 orders flood into the system.

-> What the previous record for orders in a day looked like
What the previous record for orders in a day looked like <-

It's also worth noting that we've made some pretty big changes to our database in the past few months - most notably moving from MySQL to PostgreSQL.

The issue we're seeing today has to do with how we know how much of a given product is available. Availability of a product is a loose term and has to take into account how many physical units of that product we have but also how much of those units are spoken for on active orders. Active is also a pretty loose term for an order that hasn't shipped. All of these terms are necessarily loose to accommodate all of the edge cases common to volumes we regularly see.

We were pumped about the move to PostgreSQL for many features afforded, but primarily materialized views. Building such things to keep track of available stock values really sped things up!

Until this weekend. Apparently having an order of magnitude more active orders in the system makes refreshes on our materialized view for stock take a long time, and this has led to timeouts with heavily diversified orders. So far today it's been a long haul of optimization attempts to make things hum along normally again. We're still hammering away. It's a technical problem in a big system, so there's no such thing as a quick fix.

As we continue to work on this sparkfun.com will continue to have spotty down-time. We're trying our best to minimize this while fixing the problem at hand, so thanks for being patient.

Update: 3:10PM

A missing index and some other optimizations have sped things up some. We're back to everything functioning again, but we're watching things very closely.

Also, to be more precise about the issue that plagued us: Like most of our back-end systems our warehouse system (called The Flow) was where the problem started. With so many new orders in the system the most important thing was to be able to ship them, and it's a complex thing to have thousands of orders with intersecting items that can be meted out to pickers and packers roughly in the order they were placed but only if they are paid (unless they're paying on credit terms) and only if their items are in stock enough that other orders aren't claiming that same stock. It's a fun problem that begets a lot of run-on sentences. There's a massive query in that system to get orders based on even more special picking criteria and that query was locking up, causing refreshes on the materialized view to stall, causing further timeouts down the chain.

Were we just the users and not the builders of this system the problem might have never happened. Or it might have happened and been impossible to fix without a paid support contract. Impossible to say. Either way, spending the day fighting this has not earned ire from the rest of the SparkFun crew that was left waiting for the breakage to subside. Patience was what we received, along with coffee, liquor, and Easter Candy (in that order). For that we are grateful. =)

Update 4:10PM

And now Tim brought us a keg of Easy Street for our efforts. Maybe we should unintentionally break things more often...