Using XPath on real-world HTML documents

The article on the Server Side about Web-Harvest reminded me on one of my favourite things: using XPath to extract data from HTML documents.

Obviously, XPath won't work normall on most real-world webpages, because they aren't valid XML. However, the magic of TagSoup gives you a SAX-parser that will work on ugly HTML. You can then use XPath against that SAX stream.

Here's the magic innvocation to make TagSoup, XOM and XPath all work together:

XMLReader tagsoup = XMLReaderFactory.createXMLReader("org.ccil.cowan.tagsoup.Parser"); tagsoup.setFeature("http://xml.org/sax/features/namespace-prefixes",true); Builder bob = new Builder(tagsoup); Document doc = bob.build(url);

which then allows you to do things like:

XPathContext context = new XPathContext("html", "http://www.w3.org/1999/xhtml"); Nodes table = doc.query("//html:table[@class='forumline']", context);

Cool, hu?

4 thoughts on “Using XPath on real-world HTML documents”

Hei,

Can you tell, if there’s a possibility to pair up Tagsoup and Rome?

Yes, for sure. You could use TagSoup to cleanse HTML data in RSS/Atom. I don’t know of anyone has done it, though

That’s it, so far I’ve seen no-one and my attempts have failed :)

This is great stuff.
I had the problem of unquoted html attributes (such as ‘href=http://google.com’) which screwed up my initial approaches using SAX/JDOM/….
I had to figure out the imports (maven dependencies) which were ‘tagsoup’ and ‘xom’.

Note:
The line
XPathContext context = new XPathContext(“html”, “http://www.w3.org/1999/xhtml”)
causes all html tags in your document to only be xpath-addressable using the prefix html.
Meaning: In order to get all html links with images inside them, you have to use
doc.query(“//html:a[//html:img]”, context); //note the prefix html

best regards

BadMagicNumber

My Blog, Take 4

Using XPath on real-world HTML documents

4 thoughts on “Using XPath on real-world HTML documents”

Leave a Reply