Category Archives: java

A pragmatic approach to Google AppEngine

I’ve been working on a large (Java) AppEngine project since January 2010. I recently left that job, but the project hasn’t finished and unfortunately I can’t talk about it yet.

During that time I learnt a lot of tricks and techniques for dealing with AppEngine’s idiosyncrasies, which have been useful for building a contextual advertising demo system: Qontex.com (brief synopsis: contextual affiliate ad distribution software. Not too sure what I’m going to do with it, but I had fun building it. The front end container is actually WordPress(!), but the UI is GWT and the backend is AppEngine).

Anyway, it seems useful to share a few things I’ve learnt.

1) Be pragmatic

I think of AppEngine as Amazon S3 plus some intelligence, rather than Amazon EC2 minus features. I find that a lot less frustrating.

If there is something you need that AppEngine doesn’t do well, don’t try and force it. Full Text Search is a great example: it’s horrible to try & get it to work on AppEngine, but installing Solr on a VM somewhere (or using a cloud Solr provider) is trivial.

2) AppEngine is a platform optimized for a specific type of application.

Don’t think of AppEngine as a standard Java application stack in the cloud. From the documentation:

While a request can take as long as 30 seconds to respond, App Engine is optimized for applications with short-lived requests, typically those that take a few hundred milliseconds. An efficient app responds quickly for the majority of requests. An app that doesn’t will not scale well with App Engine’s infrastructure.

Think about that for a while, and understand it well. Often Java developers are used to building corporate web apps where functionality is slowly built up over time. All too often a single HTTP request will have 4 or 5 database queries in it, and that is regarded as normal. That won’t work in AppEngine.

When you are working with AppEngine you’ll be thinking about performance continually, and differently to how you do with a normal Java application.

3) The datastore is dangerous.

In the development environment it has similar performance characteristics to a traditional database. In production it is slow at best, unpredictable at worst. If you come from an enterprise Java background, think of it as an integration server for a legacy API you are integrating with: data inside it isn’t going to go missing, but you should expect your connection to it will break at any point. You need to isolate your users from it, protect you application from it and consider carefully how to protect your data from outages.

I usually assume that a datastore query is going to take 200ms. Lately it has usually been better than that, but the variation is still a problem: http://code.google.com/status/appengine/detail/datastore/2010/11/23#ae-trust-detail-datastore-query-latency

4) Memcache is useful, but no silver bullet.

Memcache is useful because it has much more predictable performance characteristics than the datastore – and it’s a lot faster too. Generally, it’s pretty safe to rely on Memcache responding in less than 20ms at worst. At the moment its responses are around 5-10ms. See the Memcache status page for details: http://code.google.com/status/appengine/detail/memcache/2010/11/23#ae-trust-detail-memcache-get-latency

A Useful Pattern

One pattern I’ve found useful is to think of user-facing servlets as similar to the UI thread in a GUI application. Blocking should be kept minimal, and anything that’s going to take significant time is done from task queues. This includes anything beyond a single “GET” on the datastore (note that a GET operation is very roughly twice as fast as a datastore query)

For example Qontex has a process that relies on content analysis. I currently do that on-demand rather than attempting to spider the entire internet. The demo “Ad Explorer” front end is written in GWT, and it works like this:

1) Send a request to the analyze URL, passing the name of a callback function (for JSONP callback)

2) The backend checks Memcache for data about the URL. If it isn’t there, it fires an AppEngine task queue request to analyze the URL and returns a JSONP response that contains a status_incomplete flag and a wait_seconds parameter.

3) The GWT client gets the response, and sets a timer to re-request in wait_seconds seconds.

4) Meanwhile, back on the server the task queue task is being processed. That task will load the results into memcache.

5) The client re-requests the analyze URL, and this time Memcache has been loaded so the servlet can built a response with the correct data.

I use a similar, but simpler pattern to write to the datastore.

When an ad is served, or when a user clicks an ad I fire a task-queue request to record that, which lets me send a response much quicker. AppStats is great for showing this graphically:

As you can see there it would be sensible to bulk up all those memcache reads into a single read on a composite object. At the same time, the entire servlet responds on 37ms, which isn’t too bad, and some of those memcache calls are conditional – but the point is that AppStats gives great visibility into exactly how your application is performing.

Solr+Cassandra

I’ve been a big fan of Solr for quite a long time, and have used it extensively at work.

I noticed a few weeks ago that Jake Luciani had managed to get Lucene (which Solr uses) working on Cassandra (Facebook’s highly scalable keystore).

The next step had an obvious name: Solandra – Solr running on Cassandra.

Basically there wasn’t too much to getting it going in the limited form it is now – a few minor changes to Jake’s Lucandra code, a custom Solr FieldType (exactly why I needed this I’m unsure) and correctly configured solrconfig.xml and schema.xml files.

I haven’t tested updates, so you’ll probably need Jake’s BookmarkDemo to load data in.

My changes to the Lucandra index reader include hard coding (!) the fields returned by getFieldNames(..) to match the Solr schema and the fields added in the demo.

If anyone is interested, the code is available: solandra.zip. You’ll need to be a Java developer to use it, though.

The AppEngine is forking Java “controversy”

So there has been some noise from Sun about how Google AppEngine is evil because it’s not supporting the complete set of classes in the JRE. I’m sorry Sun – I’m a Java programmer, and I think that argument is shit. I’d much prefer a partial Java implementation with well defined limitations than PHP, or Python or Ruby.

AFAIK, no one has posted a list of classes missing. I can’t be bothered doing that either, but I did manually take a look at package level. Here’s it looks like GAE/J is missing:

java.applet
java.awt.*
javax.activation
javax.imageio.*
javax.jws.*
javax.management.*
javax.naming.*
javax.net.*
javax.print.*
javax.rmi.*
javax.sound.*
javax.swing.*
javax.tools
javax.xml.bind.*
javax.xml.crypto.*
javax.xml.soap
javax.xml.stream.*
javax.xml.ws
org.ietf.jgss
org.omg.*

From that list, I’d like to see javax.activation, javax.management and the remaining javax.xml.* and maybe javax.tools packages supported. The rest really don’t seem at all relevant to the AppEngine environment.

Random MP3 metadata code

I’ve been doing random MP3 metadata work lately. Here’s some code which others might find useful.

Extracting MP3 tags from mp3 file hosted on server using HTTP Range queries.

So I was using Apache Tika for various metadata stuff. I wanted to get the song title for a file hosted on a server, but Tika only supports MP3 ID3v1 metadata, which exists at the end of a file. Downloading an entire MP3 just for the title is wasteful, but fortunatly HTTP Range queries can help us out.

HttpClient httpClient = new HttpClient();
httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(10000);
httpClient.getHttpConnectionManager().getParams().setSoTimeout(10000);

String address = "http://address of mp3 file here";

HttpMethod method = new HeadMethod();
method.setURI(new URI(address,true));

Header contentLengthHeader = null;
Header acceptHeader = null;

httpClient.executeMethod(method);
try {
	//System.out.println(Arrays.toString(method.getResponseHeaders()));
	contentLengthHeader = method.getResponseHeader("Content-Length");
	acceptHeader = method.getResponseHeader("Accept-Ranges");
} finally {
	method.releaseConnection();
}

if ((contentLengthHeader != null) && (acceptHeader != null) && "bytes".equals(acceptHeader.getValue())) {
	long contentLength = Long.parseLong(contentLengthHeader.getValue());
	long metaDataStartRange = contentLength - 128;
	if (metaDataStartRange > 0) {
		method = new GetMethod();
		method.setURI(new URI(address,true));
		method.addRequestHeader("Range", "bytes=" + metaDataStartRange + "-" + contentLength);
		System.out.println(Arrays.toString(method.getRequestHeaders()));
		httpClient.executeMethod(method);
		try {
			Parser parser = new AutoDetectParser();

			Metadata metadata = new Metadata();
			metadata.set(Metadata.RESOURCE_NAME_KEY, address);
			InputStream stream = method.getResponseBodyAsStream();
			try {
				parser.parse(stream, new DefaultHandler(), metadata);
			} catch (Exception e) {
				e.printStackTrace();
			} finally {
				stream.close();
			}
			System.out.println(Arrays.toString(metadata.names()));
			System.out.println("Title: " + metadata.get("title"));
			System.out.println("Author: " + metadata.get("Author"));
		} finally {
			method.releaseConnection();
		}
	}
} else {
	System.err.println("Range not supported. Headers were: ");
	System.err.println(Arrays.toString(method.getResponseHeaders()));
}

The next thing I needed to do was extract song titles from a shoutcast stream. Shoutcast streams are kinda-but-not-quite http. Metadata is embedded in the stream (not as part of the MP3). That makes the code pretty ugly, but whatever… This code will open a connection, read the metadata and close, so you don’t need to keep downloading gigs of data.

URL url = new URL("http://scfire-ntc-aa01.stream.aol.com:80/stream/1074");
URLConnection con = url.openConnection();
con.setRequestProperty("Icy-MetaData", "1");

InputStream stream = con.getInputStream();
try {

	BufferedReader in = new BufferedReader(new InputStreamReader(stream));

	String metaIntervalString = null;
	// get the headers
	StringBuilder headers = new StringBuilder();
	char c;
	while ((c = (char)in.read()) != -1) {
		headers.append(c);
		if (headers.length() > 5 && (headers.substring((headers.length() - 4), headers.length()).equals("\r\n\r\n"))) {
			// end of headers
			break;
		}
	}

	//System.out.println(headers);
	// headers look like this:
	//		ICY 200 OK
	//		icy-notice1: 
This stream requires Winamp
// icy-notice2: Firehose Ultravox/SHOUTcast Relay Server/Linux v2.6.0
// icy-name: .977 The 80s Channel // icy-genre: 80s Pop Rock // icy-url: http://www.977music.com // content-type: audio/mpeg // icy-pub: 1 // icy-metaint: 16384 // icy-br: 128 Pattern p = Pattern.compile("\\r\\n(icy-metaint):\\s*(.*)\\r\\n"); Matcher m = p.matcher(headers.toString()); if (m.find()) { metaIntervalString = m.group(2); } if (metaIntervalString != null) { int metaInterval = Integer.parseInt(metaIntervalString.trim()); if (metaInterval > 0) { int b; int count = 0; int metaDataLength = 4080; // 4080 is the max length boolean inData = false; StringBuilder metaData = new StringBuilder(); while ((b = stream.read()) != -1) { count++; if (count == metaInterval + 1) { metaDataLength = b * 16; } if (count > metaInterval + 1 && count < (metaInterval + metaDataLength)) { inData = true; } else { inData = false; } if (inData) { if (b != 0) { metaData.append((char)b); } } if (count > (metaInterval + metaDataLength)) { break; } } String metaDataString = metaData.toString(); System.out.println(metaDataString); } } } finally { stream.close(); }

ROME 1.0RC2 Release

I’ve just pushed out a release of ROME core, ROME Fetcher and ROME modules.

For those who don’t know, ROME is a (the?) Java library for handling RSS and Atom. Unlike some other libraries it is pretty stable (18 months since the last release) and has a low number of dependencies (one – JDom – if all you need is parsing)

The annoucement, including links is at https://rome.dev.java.net/servlets/ReadMsg?list=dev&msgNo=2656

The thing I’m most pleased about (and the number one source of complaints about ROME) is that I’ve pushed it to the java.net Maven repository, so now it will be easier to use from Maven. Further details are at http://wiki.java.net/bin/view/Javawsxml/RomeAndMaven2

Installing Java on RedHat Linux by building your own RPM

It’s pretty easy to install Java on Linux – download the RPM from sun and install it. Then if you run “java -version” you’ll suddenly discover that it doesn’t really work:

java version "1.4.2"
gij (GNU libgcj) version 4.1.2 20070626 (Red Hat 4.1.2-14)

You can get around that by setting your path and JAVA_HOME, or by only using Java version that have a matching JPackage RPM and using the alternatives command

If you want to be able to build your own RPM, here’s how to do it.

 

# Be sure to enable the distro specific repository for your distro below:
# - jpackage-fc for Fedora Core
# - jpackage-rhel for Red Hat Enterprise Linux and derivatives

[jpackage-generic]
name=JPackage (free), generic
mirrorlist=http://www.jpackage.org/mirrorlist.php?dist=generic&type=free&release=1.7
failovermethod=priority
gpgcheck=1
gpgkey=http://www.jpackage.org/jpackage.asc
enabled=1

[jpackage-fc]
name=JPackage (free) for Fedora Core $releasever
mirrorlist=http://www.jpackage.org/mirrorlist.php?dist=fedora-$releasever&type=free&release=1.7
failovermethod=priority
gpgcheck=1
gpgkey=http://www.jpackage.org/jpackage.asc
enabled=0

[jpackage-rhel]
name=JPackage (free) for Red Hat Enterprise Linux $releasever
mirrorlist=http://www.jpackage.org/mirrorlist.php?dist=rhel-$releasever&type=free&release=1.7
failovermethod=priority
gpgcheck=1
gpgkey=http://www.jpackage.org/jpackage.asc
enabled=0

[jpackage-generic-nonfree]
name=JPackage (non-free), generic
mirrorlist=http://www.jpackage.org/jpackage_generic_nonfree_1.7.txt
failovermethod=priority
gpgcheck=1
gpgkey=http://www.jpackage.org/jpackage.asc
enabled=1
  • Become root
  • Copy this file to /etc/yum.repos.d. Edit it, and make sure that enabled=1 is set for the [jpackage-generic-nonfree] section.
  • Make directories required by the RPM process (I suspect you can do this outside the /usr/src directory, though):  
mkdir -p /usr/src/redhat/SOURCES  
mkdir -p /usr/src/redhat/RPMS/i586/
  • Copy the Java installation file you previously downloaded to /usr/src/redhat/SOURCES and make it executable (chmod +x <name of file>)
  • Install the tools you need to build an rpm: yum install yum-utils jpackage-utils rpm-build  (At the moment this seems to fail on 64bit machines because of missing dependencies)
  • cd usr/src/redhat/SOURCES
  • yumdownloader –source java-1.6.0-sun
  • At the moment, that will download a file called java-1.6.0-sun-1.6.0.10-1jpp.nosrc.rpm
  • Run setarch i586 rpmbuild –rebuild java-1.6.0-sun*nosrc.rpm. At the moment that gives an error message, which seems to be able to be ignored:
sh: /usr/src/redhat/SOURCES/jdk-6u10-linux-i586.bin: No such file or directory
error: Bad exit status from /var/tmp/rpm-tmp.6041 (%prep)
RPM build errors:
    user jasonc does not exist - using root
    group jasonc does not exist - using root
    user jasonc does not exist - using root
    group jasonc does not exist - using root
    user jasonc does not exist - using root
    group jasonc does not exist - using root
    Bad exit status from /var/tmp/rpm-tmp.6041 (%prep)
  • That previous command extracted a RPM SPEC file in the /usr/src/redhat/SPECS/ directory.
  • Edit /usr/src/redhat/SPECS/java-1.6.0-sun.spec. Find the part that says %define buildver and change the value to the build for the new version of Java
  • Run rpmbuild -ba /usr/src/redhat/SPECS/java-1.6.0-sun.spec. This extracts the JDK installer you previously downloaded and builds a set of RPMs from it.
  • cd /usr/src/redhat/RPMS/i586; ls;

java-1.6.0-sun-1.6.0.11-1jpp.i586.rpm        java-1.6.0-sun-fonts-1.6.0.11-1jpp.i586.rpm
java-1.6.0-sun-alsa-1.6.0.11-1jpp.i586.rpm   java-1.6.0-sun-jdbc-1.6.0.11-1jpp.i586.rpm
java-1.6.0-sun-demo-1.6.0.11-1jpp.i586.rpm   java-1.6.0-sun-plugin-1.6.0.11-1jpp.i586.rpm
java-1.6.0-sun-devel-1.6.0.11-1jpp.i586.rpm  java-1.6.0-sun-src-1.6.0.11-1jpp.i586.rpm
  • You can now install the RPM: rpm -i java-1.6.0-sun-1.6.0.11-1jpp.i586.rpm
  • For me that failed with a missing X dependency: libXtst.so.6 is needed by java-1.6.0-sun-1.6.0.11-1jpp.i586
  • I fixed that with yum -y install libX11-devel libXtst.
  • Use the alternatives command to set the correct version of Java: alternatives –config java
  • Finally: java -version:

java version "1.6.0_11"
Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
Java HotSpot(TM) Client VM (build 11.0-b16, mixed mode, sharing)
That’s it – you finally have Java working on Linux! You also have a RPM which can be installed on other machines.

My Google Interview

Well it’s just under 5 months since I promised a post about my Google interview. Being the highly active blogger that I am, I’d say its about time..

Back in early April I got an email from a recruiter from the Google European Recruiting Centre. I was a little puzzled how she got my name, but a bit of detective work found a guy I went to school with worked for Google in the London office and he’d recommended me.

After a phone converstation with the recruiter, we decided that a position in London would suit me best (I live in Australia, and have a wife and a then-2 year old child, so this was kinda a big deal), and I was handed over to a recruiter from London.

Google’s London office mostly does mobile applications, and I wanted to do web apps, but I was assured this wouldn’t be a problem because they had positions like that available too.

So the wheels were set in motion to setup the dreaded Google phone screen, and I started studying. It’s hard to explain how difficult I found that. Not only did I have to remember a whole lot of half-remembered computer science from ten years ago, but I also had to remember the Math that went with it, which was much harder to dig up. Things like big-O notion are easy enough conceptually, but when you are actually analyzing an algorithm you need to remember the maths for dealing with logorithms (for example), which isn’t something I’ve thought about much since doing Computer Science 4 back in 1997…

Interestingly, every recruiter I spoke to referred me to Steve Yegge’s “Get that Job at Google” post which I found ironic considering the “I don’t speak for Google” disclaimers he uses.

Anyway, in early May I did the phone interview with a programmer from London. While I’m not going to go into specific questions, it did involve writing some code (in Google Docs), and some highish level problem solving. For me, I found that I’d read some of Steve Yegge’s other writing pretty valuable.

I finished the phone interview feeling pretty reasonable, but I was still pretty pleased when I got called to arrange some on-site interviews in Sydney.

We arranged to do two in-person interviews with engineers in Sydney, and then two video-conference interviews with London, in late May.

So I went back to studying. Working through the stuff on Steve Yegge’s post was actually getting me more and more worried about all the stuff I didn’t know, but what else could I do?

In my first interview, the first question was pretty much my nightmare scenario. It was a (computer) math question, a (to quote the interviewer) “easy question to start you off” and it was something I didn’t know, and even worse – it was something I’d known I didn’t know but had left in favour of other things. So I muddled though the best I could and got the answer in the end, and the subsequent questions from that interviewer were better, but I was pretty worried that I’d blown it badly.

The next interview was much better. It was pretty clear that the first interviewer had told the second one that I was pretty nervous, because he kept telling me to slow down and not to worry too much. His first question was something I was much more confident about, and I got the naive solution out pretty quickly. Even better, I was able to identify that the class of problem was the same as something I’d been asked before, so I was able to skip the obvious improvement and got straight from the naive solution to the optimal solution in one step. The interviewer was happy about that, and let me choose if I wanted a low level or high level question next.

I chose a high level question, and he gave me a “design and sketch-code an appropriate interface” problem. I was very happy with that question because it’s the kind of thing I deal with most days in my work. I have what I though was an adequate answer, although I could see a lot of problems with my implementation. the interviewer was very happy with it, though, and said it was the best answer he’d seen. That surprised me, because I could see areas to improve, but I’d run out of time, so suspect him telling me that was a technique to try and get me over my “nerves”.

The third interview was the first video-conference one. There was nothing the stood out at me in this one, except it was the only interview where there was a question that involved talking about design trade offs etc instead of coding. There was also an interesting question where I forgot a pretty basic computer science concept, but once I got a hint I solved it reasonably.

The fourth interview was the interesting one. The first question involved writing a solver for a puzzle-type game. Unfortunately, it wasn’t a game I’d played before and that really cost me because I didn’t know how I’d go about solving it. To be honest, I stuggled pretty badly with this one. I did write a checker, to determine if a given solution was valid, but it was the one question I had to give up on.

The second question from that interview was probably the best question I was given. I wish I could post it here, because the approach to solving it and the optimizations used were just so typical of all the other questions, and the optimal solution is glorious, and yet is easy to understand.

So that was it – the famous Google interview. I can’t say that there were any surprises, and I came out of it with mixed feelings.

I felt that I’d done reasonably well. I’d missed one question, and struggled in another, but I thought some of my other answers were pretty good, and I hoped my second interview might have been enough to get me over the line.

I was hoping to find out quickly how I’d gone, but that wasn’t to be. My next contact with Google was at a Developer Day here in Adelaide. I was fortunate in that the recruitement consultant from Sydney was at that event, and I’d met him at the interview. One of my interviewers was also there (the second interview – the one I’d done well at).

I spoke to them both, and both were pretty positive. The recruiter actually said he’d looked at my feedback and that I shouldn’t plan to be in my current job much longer, and both asked if I was fixed on a job in London or if I’d be interested in Sydney. They both mentioned again that I’d been very nervous, which I gladly agreed with (anything to excuse my bad answers!)

I came away from that event feeling pretty optimistic.

A couple of weeks later I finally got a response from London that the position I’d been going for had been taken by an internal applicant, but that they’d like to do more interviews with me for another postion on the mobile team. I also felt that the feedback on my interview was very mixed – some very good, and some not so good, which made them feel they wanted to do another interview. That pretty much brings me up to my last post on the topic.

Some common questions:

  • Do you get asked puzzle/brainteaser questions?
    • No – they were all algorithm and coding.
  • Was it as hard as everyone says?
    • Yes. By far the hardest 5 interviews I’ve ever done.
So would I do it again? Yes I would, but I’d probably go for a position nearer to where I live. I’d also do a few things differently WRT to studying. Instead of working my way though Steve Yegge’s study list, I think I’d probably concentrate a lot more on the TopCoder algorithm questions.

Modify java.library.path at runtime

Linking to native code in Java is always a hassle. JNI isn’t exactly nice, and there are some oddities around classloaders and native libraries which are annoying if you run into them.

One thing I wasn’t aware of was exactly how hard it is to load a library it isn’t already in the directories specified by the java.library.path system property. 

Initially, I thought I’d just be able to alter that property and the JVM would pick up the new locations. That turns out not to be the case, as is shown by this (closed) bug report.

However, there is a solution, outlined in this post on the Sun forums, which revolves around altering the usr_paths field stored in java classes.

	public static void addDir(String s) throws IOException {
		try {
			// This enables the java.library.path to be modified at runtime
			// From a Sun engineer at http://forums.sun.com/thread.jspa?threadID=707176
			// 
			Field field = ClassLoader.class.getDeclaredField("usr_paths");
			field.setAccessible(true);
			String[] paths = (String[])field.get(null);
			for (int i = 0; i < paths.length; i++) {
				if (s.equals(paths[i])) {
					return;
				}
			}
			String[] tmp = new String[paths.length+1];
			System.arraycopy(paths,0,tmp,0,paths.length);
			tmp[paths.length] = s;
			field.set(null,tmp);
			System.setProperty("java.library.path", System.getProperty("java.library.path") + File.pathSeparator + s);
		} catch (IllegalAccessException e) {
			throw new IOException("Failed to get permissions to set library path");
		} catch (NoSuchFieldException e) {
			throw new IOException("Failed to get field handle to set library path");
		}
	}

Obviously, I don’t think that’s portable across JVMs, though.

The problem with OpenID is…

The problem with OpenID is branding – people get (very) confused when they get taken off site to login. I’ve watched usability testing of this, and it is truly horrible. Obviously this isn’t unique to OpenID – it applies equally to any federated identity solution (in fact – Shibboleth based federations are even worse than OpenID in this respect).

I think user education will help, but it would be really good to be able to extend OpenID to be able to put a logo on the identity provider’s site so the user can see they are logging into site “blah” via whatever open id provider.