July 7, 2014 § 1 Comment
My last blog post covered building a newspaper website with Clojure. I thought it’d be useful to go deeper and write about the problems we suffered when the newly built Clojure cluster occasionally went down. We referred to the issues causing these outages as Cluster Killers.
Cluster Killer #1 – Avout on Zookeeper
Avout is a distributed STM; giving you Atoms and Refs whereby the state is managed in a centralised store such as ZooKeeper. I’m fan of the ideas behind Avout and we used it to store all our config. Config changes in ZooKeeper would be delivered out to cluster nodes on the fly and the relevant STM watchers fired accordingly.
Our bug stemmed from our underlying network having some sporadic temporal issues. To make Avout work, the refs and atoms need to be backed by data in ZooKeeper which in turn meant that each web-app node in our cluster needed a persistent ZooKeeper connection, to keep a watch on the data. This is OK, but occasionally the persistent connection would be dropped and so each node would then need to re-establish the connection. This is still OK, and reconnection is handled automatically by the Java ZooKeeper client. So what was the problem?
The root-cause issue is tricky to explain and it took us a while to diagnose and to successfully reproduce. ZooKeeper has the concept of watchers that watch for state changes. When disconnection occurs between client and server the client-side watchers are fired but not removed. When this happens Avout re-adds the watchers. Hence if you have 5 refs backed by 5 watchers, then after a disconnect this becomes 10 watchers, then 20 etc. Eventually we had over 100K watchers on our nodes and so occasionally they self-destructed. In the middle of the night and at Christmas.
I’m aware that I should probably have fixed the problem in Avout rather than write this post. We tried switching to Atoms in Avout rather than Refs and ran into a completely different bug, and so we ended up dropping the technology altogether and rolled out our own less ambitious ZK/watcher wrapper code.
Personally I wouldn’t advise using Avout again until all the GitHub issues have been fixed and there are reports of others successfully using it in production. It looks like this will happen now as the library has gained a contributor, and so I’m hopeful.
Cluster Killer #2 – Using Riemann out the box
Riemann has achieved cult status in the Clojuresphere and deservedly so. In the past we’ve been tempted to faff around with tangental event handling logic directly in application code, but Riemann gives us a sensible place where we can put this fluidic configuration. I.e. I want to email out an alert only when X happens Y amount of times, but only at a constrained rate of Z per hour.
Riemann is a handy tool to have, but on our project it became a victim of it’s own success. We initially used it for dispatching metrics to graphite and for alerting, yet after a time other teams wanted in on the action. We took our eye of the ball as we welcomed Clojure in through the back door onto other projects. Soon our Riemann config became enriched with rules keeping so much state that the Riemann server suffered an out-of-memory.
OK, surely the Riemann server going down is not a Cluster Killer? Well, by default the Riemann Clojure client uses TCP connections and communicates synchronously with the server waiting for acks. It’s up to the code-users of the client to ensure that they handle the back-pressure building up inside the application when the Riemann client/server pipe becomes blocked.
We hadn’t deeply considered this in our rush to exploit Riemann funkage, and so when our Riemann server went down all our Clojure apps talking to Riemann went down too.
By all means use the Riemann Clojure client, but wrap it with fault tolerant code.
Cluster Killer #3 and #4 – Caching
We were building a newspaper website. As is common we didn’t serve all realtime traffic directly to the users as that would be challenging in the extreme. Instead we made use of cloud caching providers. Occasionally we tweaked the caching timeout rules so that we could serve more traffic dynamically (as was the ultimate goal), but often the wrong tweak in the wrong place could put massively more load on the server pool and the cluster would struggle. I figured that caching configuration was like walking a tightrope; even parts art and science.
On a related note we were using Mustache for our template rendering. Mustache templates need to be compiled ahead of being used for actual HTML rendering. Once a config change went live that set the Mustache compiled template cache time to zero. The CPUs went mental.
What Cluster Killers have taught me
Firstly, is that shit will aways happen. It was embarrassing when our new shit did happen as we built the system to stop the old shit from happening. Worse is that our new shit was shit that people weren’t used to seeing.
I’ve now got more respect for all kinds of monitoring. We had our own diagnostic framework that fired tracer bullets into the live environment and used Riemann to alert on errors. We were quite happy with this, as it meant that many problems were seen only by us, and we could act. We also had fantastic Ops personnel who delivered a Nagios/Graphite solution that gave us a whole manner of metrics on CPUs, heap-sizes etc for which I was grateful. We got skilled up on using AWK and friends to slice up logs files into something meaningful, and profiling the JVM was essential.
Performance testing is key as is A/B testing. I hear about companies (i.e. USwitch) doing great things in this space that I would hope to emulate at some stage.
Lastly I’m now a touch more conservative about using bleeding edge technology choices. Best to spend time browsing the project pages on github before adding to your lein deps.
Any other Cluster Killer stories people want to share?
March 16, 2014 § Leave a Comment
This screencast demonstrates how you can keep of track of host/ips that you frequently remote REPL into using CIDER.
February 27, 2014 § Leave a Comment
This screencast demonstrates some recent updates to Emacs Cider around working with multiple REPLs. This covers what’s in the relevant section in the Cider README.
Hope it’s useful.
February 24, 2014 § 8 Comments
Let’s be frank about it; the MailOnline isn’t to everyone’s taste. As the worlds biggest newspaper website it is a guilty past time for many. It has some decent editorial content but it can also be distressingly shallow.
I worked for there for a year. I was lured by the opportunity to rebuild the old website system in Clojure. Whilst some in my circle have been furthering mankind over at the Guardian, I’ve been working for alternative forces.
There is a thread of arrogance in my desire to join a big media company and petition the building of a new system. The developers I found there humbled me in being some of the most personable and most talented I’ve had the pleasure of working with. Not to mention diverse.
When I first joined this project Fred George had been extolling the values of Programmer Anarchy. I don’t want to get bogged down in discussing the particulars of this software methodology, suffice to say that it is not a software methodology. Programmer Anarchy was Fred’s way of putting developers more in charge where a command and control regime had previously existed. It was a means to shake up an establishment which has since successfully led to durable self organising teams leveraging modern technologies.
I arrived as “anarchy” was rife and the continents had yet to form. The rewriting of the public facing newspaper website had yet to begin and so there was a clear opportunity to put a stake in the ground.
The reality was that I was fortuitous. The incumbent CTO had already manifested a data pipeline where all the data needed for the website went from the old system straight into an ElasticSearch instance. He knew that only way to begin slaying the old publishing/editing system was to free up the data. Future developers could now build a new Eden by exploiting this easily accessibly NoSQL store.
Before Clojure (BC) the CTO had the idea that competing teams would trial a Node Vs Ruby Vs X solution as to best appropriate the data for building the next generation website. I was joined on my first day by Antonio Terreno and we soon started hacking on Clojure to see what we could do.
After a couple of weeks we presented a prototype of the website back to the wider team. During the presentation we demonstrated how easy it is to hit ElasticSearch from the REPL and to mash Clojure data structures against HTML templates as to get the desired output. We showed how you could easily arrange sequences of articles using functions such as
partition-by to get table layouts. With website content being broken up into simple data structures along with Clojure’s stupendously powerful sequence library, it all becomes very easy.
After the presentation the wasn’t much appetite to build similar efforts in Ruby or Node*, and so Clojure was declared as the winner.
There was nothing onerous about what we had achieved. We used some functional programming to get the data into a form that it could be easily rendered, and we were organised about how we pulled back data from ES as to get good performance.
Yet fundamentally we hadn’t written much code. I raised this sheepishly with the CTO and his response was: “that’s how I know it’s the right solution”. Clojure is the winner here.
A year later and 20k LOC the website is predominantly Clojure based. The home pages, main channel pages, and articles are served through Clojure, along with a mobile version of each. The old platform will linger on serving legacy parts of the website that need to be updated, but its day are numbered. Killing off the various dusty corners of the website requires business stakeholder input as well as bullish developers.
The bulk of the work was (and still is) around having a production ready platform; monitoring, diagnostics, regression testing, metrics gathering, performance tuning and configuration management**. We have a variety of back-end services which interact with systems such as Facebook and Twitter, and we’ve upgraded significantly the data-pipe that carries data from the old system into the new. We also have a significant web-application that proxies the main web application as to manipulate the HTML using Enlive as to feed it into a CMS for editors to use.
All this is really only phase one. There is a pent up demand from the business for new functionality and many separate sub-systems that need to be rewritten.
For me, the real win has been how the wider development team has embraced Clojure. By the end of my year there are devs hacking on elisp as to make their development environments more personable. We have vim warriors doing their vim warrior thing. There have been numerous Clojure open source contributions and involvements on the London Clojure community scene. We’ve been able to spawn Clojurians in various countries outside the UK. We had a couple of presentations at the Berlin EuroClojure and I had a world beating hangover afterwards for the plane ride home.
I’m sad to move on but I leave behind a strong team who have fully embraced Clojure and love the technology. My immediate future is working for JUXT full time (juxt.pro).
* Node.js is a dominant choice for various self-contained public facing web applications.
** I plan to do a couple of future posts about the technical particulars.
June 5, 2013 § 10 Comments
I got some comments from Renzo on a previous post I wrote about Clojure and testing, and some feedback on a recent talk I gave, via email and talk down the pub after. I ponder some of these points here.
I’ve made the point that using Clojure really does force a reappraisal of the TDD situation, especially if you’re coming from an OO background. I still think this is the case, but at the same time I acknowledge that’s there nothing uniquely special about Clojure that makes this so.
Renzo suggests that some people are fine without tests because ‘they can easily grasp a problem in their head and move on’. I agree with this but I also think that the language of choice does have a direct bearing. Put bluntly, I think that most people using Clojure will write less unit-tests compared to traditional OO programming.
It’s all about feedback.
If you were to create a shopping list of things you really want for your development experience then what would you put at the top? I think most people would put ‘rapid development’ at first place. I.e. you want to make a change somewhere in your codebase, hit F5 in your browser or wherever, and then to see the changes appear immediately.
And to be honest I think this aspect of development – rapid feedback – is absolutely key. If you can just change code and see the results immediately then it really does lessen the value of writing actual test classes to gauge the results of intended changes. This may not be completely healthy, and one could imagine more cases of sloppy hacking as a result, but I’d be surprised if there wasn’t a coloration between fast feedback of changes and the amount of unit-test classes written. For example I’d be interested to know if Java devs using the Play web framework end up writing less unit-tests as a result of moving from Spring/Struts etc. I’d expect so.
Clojure is nothing special for giving rapid-feedback on changes, just glance across at the dynamic scripting languages such Ruby, Python and PhP. But if you’re coming from the enterprise Java world, where I’ve spent some years, then this is one of the first things that really knocks you on the head.
I think the next item on the list of ‘top development wants’ will often be an interactive command prompt; the REPL, for reasons that are well documented.
I gave a recent talk and one person said that it feels strange for him as a Python developer that I and others are talking up a tool that so many have taken for granted for decades, like we’re raving about this thing called the wheel.
This is a fair point, but it doesn’t take away the amazing difference for devs coming from the traditional statically typed OO landscape, where the REPL is not so much employed.
And I think this does have a direct affect on the TDD process. Primarily, you have a place to explore and to play with your code. Put the other way if you don’t have a REPL and you don’t have ultra-fast feedback on changes, then what do you do? You’d probably want to write lots of persistent unit-tests, or main methods, just so that you can pop into the codebase and run small parts of it.
FP and Immutability
Immutability makes for a saner code-base. I make the point that if you work with a language that’s opinionated in favour of propagating state-changes, then it would perfectly reasonable to want more tests in place, as to pin desired behavour down.
FP and dynamic languages lead to a lot less code. There’s less ceremony, less modeling. Because you’re managing less code you do less large scale refactorings. Stuff doesn’t change as much once it’s been written. The part of traditional TDD that leaves behind test-classes for regression purposes is less needed.
It’s my current opinion that what you get left out of TDD once you have amazingly fast feedback and a REPL is regression testing.
This is still needed of course, but there’s huge value as treating it as a separate concern to other forms of testing. I’ve been guilty of seeing it all wrapped up in TDD, as not something distinct warranting its own attention.
In some respects deciding how much regression tests to apply becomes a bit like ‘how much performance tuning do we need’. It depends on the business context, and it’s probably better not to do it prematurely, and with consideration and measurement.
There is an argument that unit-tests serve a purpose for live documentation. I would just question this. Although some FP code can be difficult at first glance it can still be written expressively, using vars with names more than one character. And there’s always comments with examples.
This is my view on why devs using Clojure seem to write less unit-tests. I wouldn’t go to the next level of dogma and dare say that devs shouldn’t ever write unit-tests. In fact I’d expect any unit-tests on a Clojure project to be of high quality and to really have a purpose. I’m also a bit reluctant to judge TDD as a whole. I think TDD for masses of devs including myself has taught us how to effectively refactor, and through some definition of TDD we’re probably still doing it, whether it’s through the REPL or just by hitting F5.
May 21, 2013 § 6 Comments
This post is a continuation of my earlier ‘Clojure at a Bank’ posts. I’ve since left the bank and am working for a large newspaper company, fortunately for me still writing Clojure.
It’s an obvious point to make, that different projects can have very different testing demands. At the bank we managed a throughput of financial products so it was critical that we got no surprises. Prod deployments were often like moon-landings, staged well in advance with lots of people in mission control.
At the newspaper it’s a bit different. Whilst bugs are still not to be warmly welcomed with a red carpet, the direction of investment can be shifted away from pre-emptive regression testing to active monitoring, along with some A/B testing.
Though the contexts of projects can differ wildly, I can still mull over some very familiar debates and practices.
TDD can get a bad rap because it’s so often thoughtlessly applied. People used to hold up 90-100% coverage metrics like they meant something, their eyes glazing over from staring at the IDE bar going red/green all day. I know this because I’ve experienced that dogmatic mentality myself, churning out little unit-test classes because that’s how I thought software should be written. A more senior colleague once expoused to me when surveying our plateau of tests that ‘writing production code is hard’. He was right.
I think that most people would concur that having lots of unit-tests can be very expensive. Deciphering mock scaffolds is little fun, as is encountering pointless tests around wiring. Refactoring prod code can be gleeful, but refactoring other peoples tests is just a pain.
But none of this is argument against TDD, rather just a recoil against how it’s dogmatically applied. The same goes for people holding up massive opinions about Scrum and puritanical XP, and frankly OO at times. Nearly everything sucks if it’s done by the book with no human contemplation and is not continuously improved.
Still, I think that adopting Clojure really does force a reappraisal of the testing situation.
Firsly I think that everyone still does TDD, it’s just that you use the REPL instead of writing actual test files. Some scaffolding you might keep around for future use, but it has to earn its keep.
Secondly, immutability is another serious game changer, as you’d expect with Clojure. Testing functions at a command line REPL that have simple data-in, data-out contracts is trivial. Compare this to OO land where you’re passing mutable state around and the contrast becomes apparent. Code written in a style that is more likely to have side-effects does need a higher rigour of testing, where you can emphasize more with the compulsion to break everything down into very small units and leave tests behind around them, just so that you can be sure that their behavour is pinned down.
There may be a problem that devs don’t write enough tests in Clojure-land, and strangely the reverse can be also true as devs over-compensate by adding tests of questionable value. I’ve been guilty of this by adding tests around the nitty-gritty kernel of a codebase that have never failed the build and no-ones probably ever read.
This is an area where I feel I’ve learnt some. My starting point is that I’ve seen projects basically killed by people going mad with ‘specification by example’ styled tests. Fit, Concordion, JBehave… DSL frameworks that encourage non-techies to write a boatload of tests that are later unmaintainable. I’ve seen occasions where sometimes it appears to work out, like when the DSL is written in Java all the way down, but most of the time it’s a large fail.
At the bank we had thousands of FIT tests and I still fantasise about going back and removing every last one of them. They mashed together the ideas of outside-in TDD, talking to the business, talking to the QAs, regression testing and system documentation. Out of these concerns regression testing was the only persistent benefit. The cost of them was huge including slow run times and lots of duplicated HTML and Java to manage.
Our Clojure efforts led us into different territory. Working on a sub-system and with regression as our prime concern, we stored up a large catalogue of test input and output expectation datasets – ‘fixtures’. We then made the fixtures queryable and stored them in ElasticSearch. From the REPL you could easily pull back some choice fixtures and then send them through the codebase. Because the fixtures contained expected outputs, we had some nice HTML pass/fail reports produced for results. We got some of the benefits of FIT for a fraction of the ongoing cost.
This approach is often referred to as data-driven testing. The tests in our case just were frozen JSON documents. As I was leaving we played with extending this approach to cover the whole of the system. Even though our system managed a complicated life-cycle of financial products, it was still possible to watch a product going through it and make observations that you could then store as JSON fixture data. You could theorectically record everything that happened as a result of the FIT tests being run and then use this data to remove them.
I regret that we didn’t tackle the testing mess head on across the whole system before doing anything else major. But to be fair we couldn’t have done this without the learnings we got from the Clojure rewrite of a sub-system, and we what we picked up around testing in particular.
I’m going to write a subsequent post about how we’ve gone about testing at the newspaper.
May 2, 2013 § 4 Comments
The last couple of Clojure projects I’ve been on the team has invested a bit of time in pimping the REPL. Here are some of the things we’ve got happening when a custom bootstrap REPL namespace is loaded:
Dev of the day
Because it’s nice to make people feel special, show the dev of the day:
This can be pulled from GIT:
This code needs the command line tool figlet installed.
Print last few GIT commits.
Make the project feel alive with some recent commits
May as well show the GIT branch whilst you’re at it:
A dev on our team wanted to draw a picture. Rather than using something like Omnigraph he stuck to the principal of always using Emacs for everything, no matter how painful or irrational. So he fired up artist-mode and starting sketching. In one of our projects we’ve now got a nice ASCII sequence diagram explaining the flow to the wandering REPL dev. The ascii itself is stored as a txt file, slurped in and displayed upon bootstrap.
We once had a sequence of project quotes stored in a namespace and the REPL displayed one of them using rand-nth. It’s a nice way to start the day being confronted with messages like ‘The Horror.. The Horror!’, or quotes from Macbeth, Forest Gump and various gangster movies. If you get sick of the same quotes then add a new one.
Since my current project is for a newspaper company we’ve got a random headline showing up. You could in theory sit down and read the news (a flavour of) by repeatedly recompiling the REPL namespace.
- If you’ve got a web-app, launch it.
- Display what is running, ports etc.
- Use stuff –
(:use [clojure.pprint] [clojure.repl])
- Config – show what’s in ZooKeeper or whatever.
Useful REPL functions: Our bootstrap code has a bunch of handy functions in for making life easier for the dev. For example pre-loaded queries against some data-store or helper functions for data-structures you’re always munging. We have a separate repl-helpers NS that contains them and we list the contents using ns-publics.
I’m interested in what other people/teams have done to make life more interesting in the REPL. What’s cool and fun?
March 14, 2013 § Leave a Comment
I’ve written a lein plugin for deploying ring apps to the cloud using Pallet via first pulling down from a private github repo and then registering the invocation of
lein ring server as an upstart service. It’s tested to work with EC2 and an Ubuntu instance.
It was written at first for my own needs but it took me some time to get working so it may be helpful to others. It’s written against Pallet 0.8 which is the latest version. It’s up on github primarily as a reference example so use it as you see fit, rip the code and do whatever.
This is a lein plugin so if your app is simple you could just put the pallet config in your project.clj file and you’re set.
There are a few ways to get your stuff into the cloud that I’m aware of. If you’re using EC2 you could just use lein-beanstalk. Or you could skip the complexities of pulling from github and
remote-file your uberjar across and then fire it up – although you’ll still need to consider using upstart/nohup. A couple of advantages of pulling from github are that a) you can more easily separate the deployment code from your app code as the dependency between the two is just a github URL and b) it saves having to transfer around a large binary which if you don’t have fast internet can be an issue.
February 9, 2013 § 1 Comment
I read Paul Ingles’ post “From Callbacks to Sequences” which led me to Christophe Grand’s handy pipe function. The
pipe function gives you two things wrapped in a vector: a sequence to pass on for downstream consumption, and a function for feeding the pipe which in turn adds more to that sequence. I’ll reproduce the code here, with a slight alteration to limit the underlying queue as to avoid OOM:
This is great if you want to have a thread pushing stuff into the pipe using the ‘feeder’ function, and then to pass around a lazy sequence representing whatever is in the pipe. Nice.
I had a situation which was a bit different:
- I have a sequence and I want to do some processing on it to return another lazy sequence, exactly like what the
- But actually I want to do some parallel processing of the sequence, so therefore
pmapwould seem like a better choice.
- In addition, I want to be be able to govern how many threads are used during the pmap operation, and I would also like to explicitly set some kind of internal buffering so that downstream consumers will nearly always have stuff to work on.
We used the pipe function to write this.
This basically uses the pipe function having n threads feeding the pipe and then closing it when all threads are done. The act of closing it is performed by the supervisor thread and the use of a countdown latch.
Here’s an example creating a pipe-seq and consuming it:
The above takes ten seconds – ten threads consuming the seq.
We’ve got some code chaining up a few calls to pipe-seq, the consumer function being fanned out to different virtual thread-pools of different sizes, depending on the type of operation being performed.
One consideration of this approach is that the queues are kept as an internal detail of the pipe-seq function. If you want visibility over the queues this could be an issue and this approach may not be for you.
Feedback welcome, particularly if there’s a better way.
December 27, 2012 § 12 Comments
There’s often debate about whether newcomers should try to learn both Clojure and Emacs at the same time. My take on it is that yes they should and here’s five reasons why.
Smart IDEs ares like powerful drugs. They treat the sick and the dying wonderfully well but tend to concentrate on the symptoms more than the problem, that of bloated code. They are great for refactoring around masses of static types, but you do so inside of a walled garden in that the features you want are built into the IDE itself, and therefore you have to rely mostly on the IDE developers to make things better. The modern powerhouse IDEs are complected.
Emacs at heart is a basic text editor, one that has been around for decades with extensibility at it’s core. Because of this, and the fact it’s open source, the plugins for Emacs are essentially limitless in what they can do. There are vast amounts of major and minor modes and colour themes. Org mode awaits you.
It’s this ability to tailor Emacs so much that makes it feel liberating; you can do what you want with it. And if you really don’t enjoy a spot of configuration-masochism then they are many ‘starter kits’ on the web that will quickly get you going, see the ‘Emacs Starter Kit’ for starting out with Clojure. You could postpone engaging with Paredit mode if you wish, and make use of Cua mode to bring the sanity back to copy and pasting. The feeling of liberation can always be throttled.
Moving from a statically typed language to Clojure is not easy. With this in mind are we really making things that much easier by encouraging devs to stick things out in a familiar IDE? Without a pertinent and dramatic sense of change are we not just making things harder? Why try to disguise the fact that Clojure is a different development world away from Java. You can’t refactor or navigate around in the same way and the debugging just isn’t as good. Developing in a sophisticated IDE without 80% of the IDEs usefulness would likely be incredibly frustrating and a turnoff. You’d be dressed up for the wrong gig.
Change also begets change. Moving to Clojure is a chance to review a lot of things; the VCS used, testing tools, approach to CI, artefact management, anything could be up for grabs. Better to make a clean break from the past.
Following on from 2 there is the issue of support. You could be ambivalent about devs on your team using Clojure with their favourite Java IDE but then you must consider what the support cost will be. They may stumble across some half hearted wiki page about getting a certain Clojure plugin working but then would it work with Lein 2 and not just Lein 1? Is it using the deprecated Swank-Clojure or NRepl? The Clojure landscape is still shifting, newbies should probably be using the best lifeboats.
4) Simple and Easy
Like with Clojure learning Emacs from scratch isn’t excessively easy but then in the long run it’s simpler. Why would we suggest newbies to stick with their current IDEs when there’s a danger that they might not move on afterwards? Clojure and Emacs are a package of simplicity, they go well together.
And we know that simplicity is not always for free. There has to be some learning-pain before you can reap the rewards. If a developer is truly unwilling to switch to a lighter weight, more extensible dev environment then they may also be equally unwilling to make the paradigm shift to functional programming.
Of course the flip side to this is that it’s possible a developer may go to lengths to make coding even more of a pleasurable experience in the smart IDE landscape. If so, then hats off to them.
5) Mass Adoption
Trying to spread Clojure whilst recommending an IDE change sounds very counterintuitive, but then is Clojure itself ripe for mass adoption across the industry? Is this the end goal? These questions are beyond the scope of this post, but suffice to say that I’m not sure it would be a good thing if all people at the large institutions I’ve worked at started coding everything in Clojure. Emacs certainly adds to the learning divide, and makes it more explicit.
Please Vim warriors take note that these arguments could just as easily pertain to Vim.