The Wikimaps map warper – a free georeferencing georectification tool for historical maps and images hosted on Wikimedia Commons had some issues recently. Here’s a short document describing some diagnostics to identify the problems.
The wikimaps warper runs on the Wikimedia Labs infrastructure – essentially an OpenStack cluster of servers. The Labs is run by some great admins and dev ops and they are on the ball when it comes to rolling out needed updates to al the servers and instances. However each instance is generally admin’ed by people within individual projects. It was after they applied a kernel update due to a security patch and restarted all the machines that the warper was seen to be offline. The ultimate issue was that there was something wrong with the NFS – as many other instances and tools were having problems – I’m not going to blame the NFS because it showed up where the warper application was encountering issues.
The problem
The application is a ruby on rails application which runs using Apache and Passenger. It was timing out, it wouldn’t start. I ssh’d into the server. The passenger version was an old one (via a package, rather than a gem) and didn’t support increasing the timeout version. So I removed that, and installed a new version. The library would also be running using a faster version of Ruby to get running. Did all that, but the warper wouldn’t start.
I ssh’ed into the server and couldn’t see any errors anywhere. But running a console command, or a “rake -T” which would load up the Rails environment, took 11 minutes. ELEVEN MINUTES!
Eleven Minutes vs 22 Seconds – Diagnostics
I use a little old atom cpu, 2gig memory netbook, and it takes less than 22 seconds to load up there. Enter Strace and ruby-prof. (Note that the screenshots and reports following are based on me getting the loading time down to something workable, but there still work to do)
strace is a linux command line tool which traces system calls and signals. It’s amazing for debugging things. I’ve hardly ever used it. I ran strace
strace rake -T
and watched for around 5 minutes the Rails loading process. I stopped it as it wasn’t telling me much and I wasn’t prepared to wait 11 minutes! But what it did was show me something like the following – the inefficient way Rails loads up all the libraries, the gems:
strace -e trace=open,stat,read,write bin/rake -T
to trace only certain file related calls: here is an example of what it looks like on my netbook
It shows, firstly that its doing lots of file calls, and secondly that its getting lots of misses. Not very efficient but also, totally normal and not revealing.
strace with summary report is what I wanted.
strace -o strace.txt -c bin/rake -T
gave something better:
You can see in this case that it’s taking around 5 seconds with open calls (the errors are the expected misses, I think). Compare this from my netbook
5 seconds vs 0.04 seconds. But what files was it opening, what was Ruby up to? Enter ruby-prof
Ruby-Prof
ruby-prof -s total -p flat bin/rake -- -T > ruby_prof_flat.txt
ruby prof is more of a profiling tool. I’m using the command line utility, but its most often used in code, for example around a block or class. Check it out.
The ruby-prof can output its data in a few different formats or profiles. I tried the more colourful ones, which didn’t tell me much:
graph:
this is somewhat useful, but the document was massive and hard to compare or see where the bottlenecks were. You can see that “require” -the loading of the libraries was troublesome though.
colourful stack – not sure what this told me more but it looked pretty.
Before finally choosing the default option – flat:
here it is on the server
and compare that with my crappy netbook
So it gave a much better diagnostic. It’s slowing right down reading in the libraries, the gems. Now, on both machines, I am using rvm and both are loading from a home directory. But on the server, the home directory was on NFS, not a more local partition!
I copied the rvm directory to a local partition and symlinked things and now the application can start much quicker.
There’s probably more I can do to optimise this – I believe it’s still hitting the NFS in a few places, but it’s given me a good pointer in the right direction to making the wikimaps warper run better and stronger for the future.