July 7, 2014 § 1 Comment
My last blog post covered building a newspaper website with Clojure. I thought it’d be useful to go deeper and write about the problems we suffered when the newly built Clojure cluster occasionally went down. We referred to the issues causing these outages as Cluster Killers.
Cluster Killer #1 – Avout on Zookeeper
Avout is a distributed STM; giving you Atoms and Refs whereby the state is managed in a centralised store such as ZooKeeper. I’m fan of the ideas behind Avout and we used it to store all our config. Config changes in ZooKeeper would be delivered out to cluster nodes on the fly and the relevant STM watchers fired accordingly.
Our bug stemmed from our underlying network having some sporadic temporal issues. To make Avout work, the refs and atoms need to be backed by data in ZooKeeper which in turn meant that each web-app node in our cluster needed a persistent ZooKeeper connection, to keep a watch on the data. This is OK, but occasionally the persistent connection would be dropped and so each node would then need to re-establish the connection. This is still OK, and reconnection is handled automatically by the Java ZooKeeper client. So what was the problem?
The root-cause issue is tricky to explain and it took us a while to diagnose and to successfully reproduce. ZooKeeper has the concept of watchers that watch for state changes. When disconnection occurs between client and server the client-side watchers are fired but not removed. When this happens Avout re-adds the watchers. Hence if you have 5 refs backed by 5 watchers, then after a disconnect this becomes 10 watchers, then 20 etc. Eventually we had over 100K watchers on our nodes and so occasionally they self-destructed. In the middle of the night and at Christmas.
I’m aware that I should probably have fixed the problem in Avout rather than write this post. We tried switching to Atoms in Avout rather than Refs and ran into a completely different bug, and so we ended up dropping the technology altogether and rolled out our own less ambitious ZK/watcher wrapper code.
Personally I wouldn’t advise using Avout again until all the GitHub issues have been fixed and there are reports of others successfully using it in production. It looks like this will happen now as the library has gained a contributor, and so I’m hopeful.
Cluster Killer #2 – Using Riemann out the box
Riemann has achieved cult status in the Clojuresphere and deservedly so. In the past we’ve been tempted to faff around with tangental event handling logic directly in application code, but Riemann gives us a sensible place where we can put this fluidic configuration. I.e. I want to email out an alert only when X happens Y amount of times, but only at a constrained rate of Z per hour.
Riemann is a handy tool to have, but on our project it became a victim of it’s own success. We initially used it for dispatching metrics to graphite and for alerting, yet after a time other teams wanted in on the action. We took our eye of the ball as we welcomed Clojure in through the back door onto other projects. Soon our Riemann config became enriched with rules keeping so much state that the Riemann server suffered an out-of-memory.
OK, surely the Riemann server going down is not a Cluster Killer? Well, by default the Riemann Clojure client uses TCP connections and communicates synchronously with the server waiting for acks. It’s up to the code-users of the client to ensure that they handle the back-pressure building up inside the application when the Riemann client/server pipe becomes blocked.
We hadn’t deeply considered this in our rush to exploit Riemann funkage, and so when our Riemann server went down all our Clojure apps talking to Riemann went down too.
By all means use the Riemann Clojure client, but wrap it with fault tolerant code.
Cluster Killer #3 and #4 – Caching
We were building a newspaper website. As is common we didn’t serve all realtime traffic directly to the users as that would be challenging in the extreme. Instead we made use of cloud caching providers. Occasionally we tweaked the caching timeout rules so that we could serve more traffic dynamically (as was the ultimate goal), but often the wrong tweak in the wrong place could put massively more load on the server pool and the cluster would struggle. I figured that caching configuration was like walking a tightrope; even parts art and science.
On a related note we were using Mustache for our template rendering. Mustache templates need to be compiled ahead of being used for actual HTML rendering. Once a config change went live that set the Mustache compiled template cache time to zero. The CPUs went mental.
What Cluster Killers have taught me
Firstly, is that shit will aways happen. It was embarrassing when our new shit did happen as we built the system to stop the old shit from happening. Worse is that our new shit was shit that people weren’t used to seeing.
I’ve now got more respect for all kinds of monitoring. We had our own diagnostic framework that fired tracer bullets into the live environment and used Riemann to alert on errors. We were quite happy with this, as it meant that many problems were seen only by us, and we could act. We also had fantastic Ops personnel who delivered a Nagios/Graphite solution that gave us a whole manner of metrics on CPUs, heap-sizes etc for which I was grateful. We got skilled up on using AWK and friends to slice up logs files into something meaningful, and profiling the JVM was essential.
Performance testing is key as is A/B testing. I hear about companies (i.e. USwitch) doing great things in this space that I would hope to emulate at some stage.
Lastly I’m now a touch more conservative about using bleeding edge technology choices. Best to spend time browsing the project pages on github before adding to your lein deps.
Any other Cluster Killer stories people want to share?