My previous post I’ve already tried the ‘waving a dead chicken over our servers’ trick attracted a bit of attention, and quite a number of suggestions – thanks to all who contributed. The suggestions seemed to fall into four main categories:
- Database tuning.
- This is a good suggestion, and is something we’ve done a fair bit of. In this case it doesn’t really help because the problem wasn’t performance but stability.
- Introduce a caching layer
- We’d already done this, twice. We initially used an ehcache caching filter to fix some pretty serious performance problems. We later added some OSCache JSP cache tags in some critical areas in some templates (it was the addition of OSCache which caused the performance boost seen in my post on monitoring performance using the Google Webmaster Tools). As it turned out this combination may have been what caused our problem.
- Rewrite everything
- Thanks. Let me know when you get a job in the real world.
- Debug the problem
- This is what I figured we’d have to do. It’s something I was attempted to avoid because the issue seemed to be threading related, and we couldn’t reproduce it anywhere except our production environment.
We did have one stroke of good luck. We were able to predict when the site would stop working by monitoring the number of threads Apache was using and we could use this information to preemptively restart the site. We were able to modify the restart script to generate stack traces for all the JMV’s threads (kill -SIGQUIT <jvm pid>).
Since it looked like I’d actually have to start debugging this problem I started looking through the stack traces and I noticed that lots of the threads were in the ehcache filter. Now this wasn’t necessarily a bad thing, since all http request would be passed though it. However, it did make debugging harder, was easy to remove (just comment it out in the web.xml) and did have some potential to be a source of problems – in particular the cache-invalidation part.
So we took a punt and removed the filter and… it fixed the problem. Yay! I’m a genius and all that.
Except…. now the CMS is crashing with a NullPointerException deep in the data persistence layer. There’s also the small problem that I don’t have a clue why that change fixed it. Using the ehcache filter on its own worked fine, and there is no programmatic interaction between the ehcache and oscache code.
There is an alleged fix for the NullPointerException – but we have to take a point release of the CMS, and then patch it with a service pack to get it. Our previous experience with upgrades have been less than confidence inspiring.
In the mean time we have a script watching the site and restarting it when it crashes. It’s kind of like failover, without the over bit.