Recently read books
I’ve read a bunch of books over the past couple of months. Here’s some short reviews.
I’ve read a bunch of books over the past couple of months. Here’s some short reviews.
My previous post I’ve already tried the ‘waving a dead chicken over our servers’ trick attracted a bit of attention, and quite a number of suggestions - thanks to all who contributed. The suggestions seemed to fall into four main categories:
We did have one stroke of good luck. We were able to predict when the site would stop working by monitoring the number of threads Apache was using and we could use this information to preemptively restart the site. We were able to modify the restart script to generate stack traces for all the JMV’s threads (kill -SIGQUIT <jvm pid>).
Since it looked like I’d actually have to start debugging this problem I started looking through the stack traces and I noticed that lots of the threads were in the ehcache filter. Now this wasn’t necessarily a bad thing, since all http request would be passed though it. However, it did make debugging harder, was easy to remove (just comment it out in the web.xml) and did have some potential to be a source of problems - in particular the cache-invalidation part.
So we took a punt and removed the filter and… it fixed the problem. Yay! I’m a genius and all that.
Except…. now the CMS is crashing with a NullPointerException deep in the data persistence layer. There’s also the small problem that I don’t have a clue why that change fixed it. Using the ehcache filter on its own worked fine, and there is no programmatic interaction between the ehcache and oscache code.
There is an alleged fix for the NullPointerException - but we have to take a point release of the CMS, and then patch it with a service pack to get it. Our previous experience with upgrades have been less than confidence inspiring.
In the mean time we have a script watching the site and restarting it when it crashes. It’s kind of like failover, without the over bit.