A Note On This Week’s Server Maintenance
Over the last 48 hours, we encountered some site performance issues across the Pocket platform. The issues were sporadic – some of you may not have noticed – but over the past 2 days, we had the site in maintenance mode for about 2 hours, and experienced an additional 10 hours of degraded performance. I wanted to take a moment to provide a bit more insight into what happened and what we did to fix the issues.
On behalf of the entire Pocket team, I want to apologize. We pride ourselves with Pocket being a service that “just works” – providing a trustworthy way to save interesting content for later and always making sure it is available for you to view whenever and wherever you want. Underlying this promise is a platform that we’ve spent a lot of time tuning to ensure that it is fast and responsive for all of our users – in fact, outside of some planned maintenance windows, we haven’t had more than a few minutes of downtime in over a year. However, as the issues above highlight, we weren’t able to keep our promise over the past couple of days.
On Tuesday night we started a routine maintenance to swap out our database servers with fresh new ones. This is a process we’ve performed in the past without incident, and it typically requires about 15 minutes of downtime while we swap things out and connect everything back up. We host our services on Amazon Web Services, who have been an excellent partner over the past few years, empowering us to grow with our users and support all the myriad ways you have chosen to use Pocket. For this particular upgrade, we decided to switch the instance type to take advantage of some of Amazon’s newer offerings.
Unfortunately, after the upgrade we started noticing some problems. Specifically, the latency of requests to the new database from our web servers increased significantly. We made some adjustments Tuesday night and things seemed to be working again as we went to sleep. Early Wednesday morning, though, our server monitoring alerted us that things were still awry, and we started to dig in further.
By 11 AM PT on Wednesday morning we had the site stabilized. As we continued to monitor things over the course of the day, we still were not happy with the performance of the new database servers. Despite specs that suggested the new databases should be an improvement, they just weren’t in practice. Since our traffic has a cycle to it (with mornings being the heaviest), we decided to track how things handled into Thursday morning.
On Thursday morning we started seeing heightened errors again with the additional morning rush, and decided to take corrective action by swapping the database servers back to the original specs and instance type. We completed this second swap at 11:45 AM PT, after which everything has been back to normal and performing at the level we (and you, our users) expect.
In regards to this specific issue, there is no additional maintenance work to complete. Everything is fixed.
Updates like this are a necessary part of a growing company like Pocket. We’ve learned a few things on how to improve our process going forward. We will also be working with Amazon to understand exactly why the new instance types didn’t work as we expected they would.
In closing, thank you for your patience and support while we worked through this issue – we love working on Pocket and promise to keep making it better and more reliable for all of you!
If you have any further questions, please feel free to reach out to us via email at firstname.lastname@example.org or Twitter at @pocketsupport.
– Matt Koidin, CTO