Updated:I’ve already tried the ‘waving a dead chicken over our servers’ trick

June 24, 2007java, techNick Lothian

The title of this post says it all, really. But the long story follows… (this made TheServerSide. Unfortunately I can’t login there – I think that’s our work firewall problem – so I’ll update here to give a few more details)..

So at work we use a Java based CMS. It turns out that this particular CMS does tens of database queries for every pageload. Surprisingly, this doesn’t scale well….

We’ve added our own caching layer(s) which have helped the speed some, but it’s still not reliable enough to be satisfactory.

In the past I’ve fixed a problem like this by using curl to make a static copy of the site and some mod_rewrite magic to redirect visitors. In this case that’s unlikely to work, because there is just enough dynamic content to make it more trouble that it is worth.

The obvious solution is a rewrite, but that isn’t going to happen, and I don’t want to be doing any more patching of the *&$!#@ CMS.

The only thing I can think of is to use Pound as a load balancer, with a second copy of the CMS taking over the content generation when the first one crashes and restarts. I think that will work, but it is kind of a band-aid solution and comes with a whole set of its own problems. For example, it doesn’t look like running two copies of the CMS off the same database will work, so we’ll need to replicate the database. Then we’ll need to make sure the content updates go to the correct CMS/database combination.. etc.. etc.

All in all I think we are up the proverbial creek. But if anyone has any ideas.. I’m all ears.

Update: I’m reluctant to name the CMS, but it isn’t Vignette. If have a good reason to know I’ll try to respond to emails. That database is Postgres 7.4. Load on the database isn’t a huge problem, but it looks to me that the number of round trips to build a page is (when I say “tens of queries” – it’s a lot more than 10 – more like 50).

We’ve implemented some custom caching using a EhCache filter and OSCache JSP fragment caching.

The specific problem isn’t performance – it’s stability. After a some hours running the site just stops responding. We’re currently trying to figure out the exact cause of that via stack dumps, but with hundreds of threads it is a difficult process.

A front end cache won’t work, because the CMS uses the ‘Vary’ header, which makes pages uncachable. (Actually – I’m considering writing a filter to strip out that header so I can try using Squid).

The vendor doesn’t respond to support request. Yes, we have a support contract, and yes, that is pretty bad.

23 thoughts on “Updated:I’ve already tried the ‘waving a dead chicken over our servers’ trick”

Ivan Ristic says:

June 25, 2007 at 1:37 pm

Hmmm, how about installing a caching reverse proxy (e.g. Apache) in front of your existing web server?
Nick Lothian says:

June 25, 2007 at 8:35 pm

Well… I’ve used squid previously for a similar problem. It might help some for images etc, but unfortunately the CMS sends the “vary” header, which makes it’s output uncachable… ever though we don’t use the functionality that is used for.
Tim Vernum says:

June 25, 2007 at 9:47 pm

How complicated are the SQL queries?
The first option would be to tune them, “tens of queries” per page isn’t that bad. I’ve worked on apps that did that number of hits and performed well enough.

The more complicated alternative would be to make the caching more reliable. What is the main obstacle to that?
Nick Lothian says:

June 26, 2007 at 3:13 am

@Tim:

You are right – tens of queries isn’t too bad. But this is from the CMS, not our own code, and trying to tune the queries means patching the CMS (and trying to figure out how it works). It is probably the best long term option, but it’s a much bigger job than I’d like.

Re: the caching. On top of the (fairly ineffectual) CMS caching, we have an EhCache filter on the front end, and osCache JSP cache tags in our templates. They work ok – but the problem is that our site has a LOT of (dynamically generated) pages, and while maybe 30% of our traffic is on a couple of pages the rest is distributed across the site in a way that makes caching not as useful as it could be. On top of that there’s the problems of cache invalidation, and explaining to content editors exactly why they aren’t seeing their changes, and that no – we can’t really tell you exactly when they will be up..
insac says:

June 26, 2007 at 6:25 am

A couple of questions:
– what’s the CMS?
– what’s the underlying RDBMS?
– have you done an analysys of the reasons it doesn’t scale well? (the number of queries alone might not be so important)

Just an example:
we had a problem with a Java based CMS and an Oracle underlying database. The problem was that the CMS did not use bind variables: just this simple problem was preventing our site from being scalable. The workaround was to set the CURSOR_SHARING oracle parameter to the value SIMILAR. The final solution was a patch from the CMS product team.

A similar problem was that another part of the CMS used bind variables but changed the name of the variables with every query execution, thus preventing Oracle from reusing the execution plans. Also this was fixed by a patch from the CMS product team (in this case the CURSOR_SHARING parameter would be useless).
Gregg Obst says:

June 26, 2007 at 7:04 am

Let me guess, you are battling with Vignette V7 ? If not, maybe shedding some light on which CMS system it is might help those with specific experience with the same CMS to offer some suggestions.
Rob Di Marco says:

June 26, 2007 at 7:17 am

If the problem is the load on the database server, what about using clustered JDBC (http://c-jdbc.objectweb.org/) and a couple of additional DB servers? You don’t have to change anything in the application other than the JDBC connection string and you can scale the database horizontally.
matt mcknight says:

June 26, 2007 at 7:31 am

It’s hard to give good ideas here, because I don’t know what aspects of the CMS design you are free to change. It appears you have isolated DB access as the bottleneck, but even that isn’t clear as you appear to be considering load balancing in front of the application server. You didn’t explain where in the DB process the bottleneck is (assuming the machines aren’t CPU constrained, network issues, etc.) Overall, an architecture where you move closer to all static content is going to be better than the 10 queries a page average. Even if you make the queries quite fast, you are still paying a price for their quantity.

Here are some ideas-
-Change the CMS so that it publishes static html files for any non user specific content.
-Reduce the number of queries. Redesign pages, create materialized views, denormalize, whatever, just cut down on those queries.
-Use a load balanced database cluster (mysql is cheap). Write data to each database, read from any. The load balancing ever so slightly increases the cost of each query, so you are best to reduce the number of queries first.
-In memory database for reads. Replicate the on disk database into a in memory database (TimesTen or some such) that is used by all of the page queries.
-Move the database and the application onto the same server (but make that server a beast)
-Switch to a faster database or application server platform
tireetoo says:

June 26, 2007 at 8:04 am

You need to speed something up somewhere. At the front end (caching static pages SIMPLE) in the middle (caching dynamic pages COMPLEX requires invalidation mechanism) or the back end (SIMPLE improve the performance of the data collection via less SQL calls or more optimised SQL calls).

You have not said where your bottleneck is? CMS server cpu maxed out? DB server cpu maxed out? Network bandwidth maxed out? FileIO maxed out? Each of these has a different solution.
Tim Howland says:

June 26, 2007 at 8:16 am

I had to scale up an openCMS system (www.opencms.org) in a big hurry a few years back. I did it in several steps:

1) OpenCMS supports JSP Fragment Caching- it may be available in your CMS. This can help a lot.

2) Tons of MySQL tuning, particularly around the query cache- most CMS queries are pretty much the same thing over and over, so the query cache can be really helpful.

3) Identified key pages (the home page and other hot landing pages) and set up apache mod_rewrite to pull from static copies of these pages. These static copies were refreshed by a script (using curl and rsync) on a 1x / minute basis, and were set not to over-write the cached copy if a 500 error was found. This got us around the squid “vary” header issue.

An alternate approach would be to set up some slave read-only mysql boxes, and replicate from the master to the slaves for your production web servers. You’d want to isolate your content management system to the server that was tied to your master DB, and then use replication to push the content to the slaves. This assumes that most site visitors are consuming content, not conducting transactions on the site (or you’ll need to manage two DB connections).

Good luck!
dmitry says:

June 26, 2007 at 9:31 am

hmm.. i don’t have a suggestion, but I am currently looking for a CMS for my company and actually considering a Java based one. Could you tell me what you are running so that I don’t fall into the same manhole? If you aren’t comfortable publically publishing CMS’s name, could you send it to me privately via email?
thanks
Thomas says:

June 26, 2007 at 10:34 am

Look for ways to partition the data where the load is bad (db, cms, both, etc.). Not that it would be easy, but it should be somewhat possible to toss half the data on one server, half on the other (by age, department, users, application, etc.), and come up with some fast way to automatically send users where they need to go.

In your response to caching, you stated that you had caching for CMS, and the JSP pages. Maybe your caching is too coarse grained? Would it be possible to cache the result of some of those 10’s of queries per page, while letting the rest of the JSP be dynamic? To help out in those cases where caching the entire page just doesn’t help much.

You might be able to find opportunities to cut the db out the loop entirely as well. I believe a comment on theserverside.com suggested a fast index, like lucene, to cache some objects when ram isn’t big enough — which is another good way to spread load across servers.

Hard to give any specific advice without knowing the application a bunch more. :)
David Dossot says:

June 26, 2007 at 10:43 am

I am curious about these dynamically generated pages. What is the rationale for the dynamic generation? Search results? Application data that changes a lot?
Doug Lane says:

June 26, 2007 at 2:25 pm

If your content is dynamic enough and you’re using MySQL, then disable the query cache. It sounds counter-intuitive, but if the cache hit rate is below 25%, then this will definitely help. Caching carries some overhead with it.
David Dossot says:

June 26, 2007 at 5:25 pm

CMS are not designed for serving highly dynamic content. They usually perform too much content access (whether it is in a DB or a JCR repository) to build a single page that they underperform if the page is not cached as a whole.

Very often, the dynamic parts are minimal: most of the page comes from authored content that is pretty static, time-wise.

There are different ways to “inject” bits of dynamic data in a static cached page, some ugly, some less.

You can have the CMS generate client-side scripting that will call back the server for the dynamic bits and display them in the right places on the page. This requires caching, either at client side with a cookie, or between the client and the server.

Another option is to use the CMS as a template provider and have your application using these templates to decorate their dynamic outputs with a framework like Sitemesh.
Nick Lothian says:

June 26, 2007 at 6:51 pm

@David Dossot: Yes – they are mainly search results.
insac says:

June 27, 2007 at 6:00 am

> The specific problem isnâ€™t performance – itâ€™s stability.

Just another couple of questions:
you say that “after a some hours running the site just stops responding”.

There is a “time pattern” in this behaviour?
A – it stops responding after X hours from the restart
B – it stops every day at about X o’clock
C – it stops every morning when the load begins to increase

Last question: have you the possibility to increase the number of the DB connections available to the CMS. If you’ve already tried it, what effect it had on performance and stability?
Tim Vernum says:

June 27, 2007 at 7:01 am

__Query tuning__
Depending on the queries and the database design you may be able to tune it by adding indexes without changing the queries.
Alternatively, are the queries pulled from config files? I had to tune a vendor system that did exactly that. I was able to tweak a few properties files and that sorted it out.

__Crazy solutions__
You could write your own JDBC driver and catch the inefficient/unnecessary queries and handle them yourself.
If the queries are moderately well written, then it should be feasible enough to work.

__Stability__
In my experience, lock ups tend to be concurrency/synchronization issues. The last one I ran into was a log4j issue (in an older version) where all the threads were locked up on the async buffer.
The thread dumps don’t sound like fun, but they sound like the best path to enlightenment.
Nick Lothian says:

June 27, 2007 at 4:02 pm

@insac: No pattern that we’ve been able to figure out.

@Tim: We’ve tried the index thing. It is actually possible that the default database setup has too many indexes – but as I said, database load doesn’t seem to be an issue.

I suspect a threading issue, too. I have some experience debugging that kind of problem, and it’s not something I want to do – especially when it isn’t our code. Hence the waving of dead chickens…
Brad says:

June 27, 2007 at 5:08 pm

â€œafter a some hours running the site just stops respondingâ€ – i had a very similar problem. It turned out to be thread locking due to use of the double-checked locking pattern. The story went something like this..

CMS site with some custom code for dynamic content. The dynamic stuff was a bit slow so i put in some caching. It was still a bit slow so i guessed i could make things faster by eliminating synchronization of the cache access – hence the double-checking-locking. All went swimmingly, including up to 2 days of continuous heavy load testing on a single CPU server.

In case you’re not familiar with the double-checked problem, its when you write code like this..

Object getCachedThing() {
if( cached == null ) {
synchronized {
if( cached == null ) {
cached = lookupFromDatabase()
}
}
return cached
}
}

There’s lots of literature on it but the bottom line is it can’t ever work reliably in a multi CPU environment. Note that in my case i was looking up via a key and storing in a hashmap.

Once in production (on clustered multi CPU machines) the site would gradually grind to a halt. For some reason, the double checked thing resulted in threads deadlocking – not at all an expected result even once i knew that double-checking was a no no. I thought i’d get null pointers. Looks like its something in HashMap which can make it dead lock if accessed by multiple threads.

Might not be your problem, but sound like you have the same symptoms i had

Cheers
Geoff says:

June 28, 2007 at 11:28 am

What OS are you running on for both appserver and dbserver?
Geoff says:

June 28, 2007 at 11:35 am

Another question and possible answer, does server actually crash or does it just stop accepting requests? If it’s the latter then one initial thought I’ve had is that it’s running out of available threads to process the requests, so it then appears to hang. If it crashes then all you need is somebody who can read dumps, right :-)
Nick Lothian says:

June 28, 2007 at 8:22 pm

@Brad: Yes, I’m aware of the double-checking locking pattern problems. That’s not the problem here, but I think the caching may have something to do with it

@Geoff: Redhat AS3. It runs out of threads and stops responding (or actually Apache runs out of threads). Increasing the threadcount in Apache just delays hang.

BadMagicNumber

My Blog, Take 4

Updated:I’ve already tried the ‘waving a dead chicken over our servers’ trick

23 thoughts on “Updated:I’ve already tried the ‘waving a dead chicken over our servers’ trick”

Leave a Reply