All posts by Nick Lothian

Switched to Feedburner

I've switched my feeds at my BadMagicNumber blog to be published via FeedBurner. There should be no disruption to your normal programming, although I think some aggregators will show some non-new items as new.

I switched the feeds over using a simple Servlet Filter. If anyone wants to do the same, here's the code. This works for blojsom, but you might need to modify it slightly for your own setup.


public class FeedBurnerRedirectFilter implements Filter {
	private String redirectURL;
	
	public void init(FilterConfig config) throws ServletException {
		redirectURL = config.getInitParameter("redirectURL");		
	}

	public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) 
						throws IOException, ServletException {
		String flavor = request.getParameter("flavor"); 
		
		if ("atom".equals(flavor) || "rss".equals(flavor) 
				|| "rss2".equals(flavor) || "rdf".equals(flavor) ) {			
			HttpServletRequest httpRequest = (HttpServletRequest) request;			
			HttpServletResponse httpResponse = (HttpServletResponse) response;
			
			String userAgent = httpRequest.getHeader("User-Agent");
			if (userAgent != null && userAgent.indexOf("FeedBurner") < 0) {
				/// redirect if not feedburner
				httpResponse.sendRedirect(redirectURL);
				return;				
			}
			
		}		
		chain.doFilter(request, response);		
	}

	public void destroy() {
		
	}
}

Using XPath on real-world HTML documents

The article on the Server Side about Web-Harvest reminded me on one of my favourite things: using XPath to extract data from HTML documents.

Obviously, XPath won't work normall on most real-world webpages, because they aren't valid XML. However, the magic of TagSoup gives you a SAX-parser that will work on ugly HTML. You can then use XPath against that SAX stream.

Here's the magic innvocation to make TagSoup, XOM and XPath all work together:

XMLReader tagsoup = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser");
tagsoup.setFeature("http://xml.org/sax/features/namespace-prefixes",true);
Builder bob = new Builder(tagsoup);
Document doc = bob.build(url);

which then allows you to do things like:

XPathContext context = new XPathContext("html", "http://www.w3.org/1999/xhtml");
Nodes table = doc.query("//html:table[@class='forumline']", context);

Cool, hu?

When all you have is a hammer

The biggest difference between 1.5.0 and 1.7.1 is that 1.7.1 uses two databases at once. In order to support fulltext matching, a new feature, we use a MyISAM database in MySQL with FULLTEXT keys (see http://dev.mysql.com/doc/refman/4.1/en/fulltext-search.html). However, InnoDB is still faster for JOIN's and offers referential integrity, so as a compromise we run two databases and keep them synchronized with MySQL replication (see http://dev.mysql.com/doc/refman/4.1/en/replication.html).

From the Connotea docs

And I'm back.

Yes, I'm back (and have been for a few weeks).

The wedding went well – and yes, thanks to a last minute flight to London by my friend it actually was a real wedding.

I had the worst trip home ever – left Edinburgh at 9:00 am on Wednesday, arrived home at 7:30 pm on Friday. I am sick of airports, and airplanes where the seat won't lock in position, so if you have it upright it slowly starts reclining, and if it is reclined it slowly starts going upright.

In unrelated news I'm now (a) working Monday-Wednesday so Maya can go back to work on Thursdays and Fridays while Alex & I hang out, and (b) I'm now technical architect at work. I'm not quite sure how (a) will effect (b), but we'll see….

Away for two weeks

I'm off to the UK for two weeks for my friend's wedding (which is now a “commitment ceremony” due to various immigration issues).

I'll be checking my email, but don't expect well-thought out responses….

For the interested, here's some stuff to read while I'm away:

Interactive TV: Conference and Best Paper

This paper describes mass personalization, a
framework for combining mass media with a highly
personalized Web-based experience. We introduce
four applications for mass personalization:
personalized content layers, ad hoc social
communities, real-time popularity ratings and
virtual media library services. Using the ambient
audio originating from the television, the four
applications are available with no more effort than
simple television channel surfing. Our audio
identification system does not use dedicated
interactive TV hardware and does not compromise
the user’s privacy. Feasibility tests of the proposed
applications are provided both with controlled
conversational interference and with “living-room”
evaluations.

Detecting Spam Web Pages through Content Analysis

In this paper, we continue our investigations of “web spam”: the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).

I like this paper, because I used some very similar techniques in my de-spammed version of the Google Blog Search.

Develop AJAX applications in Java (just Java)

Google Web Toolkit (GWT) is a Java development framework that lets you escape the matrix of technologies that make writing AJAX applications so difficult and error prone. With GWT, you can develop and debug AJAX applications in the Java language using the Java development tools of your choice. When you deploy your application to production, the GWT compiler to translates your Java application to browser-compliant JavaScript and HTML

From Google Web Toolkit – product overview

Wow. And it looks like it actually works….

First Greasemonkey script: Mark All As Read in Google Reader

My quest to make Google Reader work the way I want it to continues.

One of the biggest complaints people have with Google Reader is that there is no way to mark all items as read. Having unread items make people feel under pressure, which I (I believe) is one of the things that river-of-news style aggregators are supposed to avoid. Unfortunately, not feeling the pressure of unread items in Google Reader at the moment is something of a Zen meditation Lose Weight Exercise.

This Greasemonkey script solves that. To use it, go to the “Edit Subscription” page, then in the “More Actions” box you'll find a “Mark all as read” action. The first time you use it you may need to run it a couple of times to completely clear out your reading list (there doesn't seem to be a good way of finding out if the reading list display is empty or not).

Tested with Firefox 1.5.0.1 and Greasemonkey 0.6.4.

Download the script here.

Screenshot