A look back at what went wrong, why and how we fixed our issues with data.sparkfun.com
Some of you may have noticed that we had trouble with data.sparkfun.com two weekends ago and into Tuesday the 14th. We are currently up and running. Basically our hard drives filled up. We currently have larger disks supporting the system and we are building out a new monitoring plan that should give us better visibility into the health and status of data.sparkfun.com.
The longer story, we weren’t monitoring disk usage, at least not in a meaningful way, and not in a way that would prevent the system from locking up. We use a multi-server architecture, but we use complete mongo replication, so the same data lives on all servers. While this helps fault tolerance if a web server or database goes down, it does nothing to prevent disk failures. In fact, it ensures that if one server goes down with a disk full error they all will, because they are currently the same size.
As we tried to restart and provision one of our VMs it got into a bad state. We probably let it cycle in that state too long. We eventually killed it, created a new VM, and threw it into the cluster. This was a fine solution but Mongo wanted to fully replicate before it would allow new connections to establish. This data replication took longer than expected.
Data.sparkfun.com is a mouth full to say as well as type, so from here on out I’m just going to call it "Data," and hopefully the good people at Universal Studios don’t mind a loving tribute.
SparkFun has been reminded that we need to be good caretakers of Data. We need to keep an eye on it to ensure it keeps working for our users. Over the next few days and weeks we are going to plan and execute a series of small under-the-hood changes to ensure reliability and robustness for Data. We want you to have the same great user experience, we just need to be better at managing the streams that come in and fill those disks up. As we make these changes we plan to inform the user community about what we are doing.
Data is powered by phant, an open source IoT database that is built and maintained by SparkFun. It’s important to remember that phant was, and still is, operating just fine. It was an infrastructure problem surrounding our implementation and monitoring of core phant. Once we were able to stabilize our VMs, the system started right back up running at full strength. The team was about to get Data back up around around 11:00 AM MST Tuesday the 14th of July. Seven hours later we had had 381 streams of data updated with over 294,000 data pushes in that timeframe. That is a sign of a stable system being used by a lot of interested people.
This is very exciting news for us at SparkFun, as we like seeing an engaged user community and the continued and expanded adoption of Data definitely qualifies. I’m very impressed by the Data user community. I’m glad you like our system, I’m glad you are using it and I hope it continues to be helpful.
Please keep using Data and we’ll do our best to keep it up for you in the future.