Why question answering is hard

IBM recently semi-released their Watson APIs. Judging from the comments on the HN thread many don’t realize just how hard question answering (QA) is, nor just how good Watson is.

I’ve spent a fair bit of time building my own QA system. As is often the way, that’s given me some insight into what a big problem space this is.

To be clear: I think QA is a AI-Complete problem. Any system that works on a subset of QA is a good achievement.

Most QA systems have roughly 3 parts:

  1. Fact extraction
  2. Understanding the question
  3. Generating an answer

Fact Extraction

Fact extraction means understanding domain-specific data in a way that allows your system to use it to answer questions.

At its simplest, this means building indexes of keywords so a simple keyword search will match documents containing facts. That goes someway towards question answering, but probably isn’t enough in most domains to build a state-of-the-art system.

Fact extraction usually has two parts:

  • Entity Extraction
  • Relation Extraction

Entity extraction means finding “things” that facts are about, and what type a thing is. To a large extent this is a Natural Language Processing (NLP) process: find nouns in a paragraph and they are probably entities.

Relation Extraction is related (ha!). It means understanding how entities mentioned in some text relate and possibly how they relate to other entities the system knows about.

Consider the sentence: “Jane was born in Birmingham, England”.

A good system will extract the following:

  • Jane is a person-type entity.
  • Birmingham is a place-type entity
  • England is a place-type entity
  • Jane was born in Birmingham
  • Birmingham is part of England

There are some reasonable open source systems for entity and relation extraction. The Stanford Named Entity Recognition software is probably the best known. The Python NLTK library can do it, and most recently the DARPA-funded MITIE project looks pretty nice.

“Inference” is sometimes considered part of fact extraction (or it may be considered part of answering). Inference means generating new facts by inferring them from others using rules. For example, from the sentence above we can infer that Jane was born in England (because she was born in Birmingham, and Birmingham is part of England).

There are a number of ways to do inference. The Prolog language was pretty much designed for it, and many “semantic web” datastores and query systems have inference engines built in.

An alternate way of doing Fact Extraction is manually. This sounds extremely stupid, but is actually reasonably common. Notably, Wolfram Alpha is apparently manually curated.

Most systems use a hybrid: for example, many will use a manually curated fact base (eg, Freebase and/or DBPedia) to attempt to verify automatic entity generation.

One thing to be aware of with this part of the system is that “facts” will often conflict so the system needs a way of resolving that conflict. The reasons for these conflicts are many, but some of the most obvious are that data changes over time (“How many euros in a US dollar?”) and that human biases mean facts aren’t as clear as they should be (“Where was Obama born”).

Most systems deal with this by attempting to assign some kind of reliability score to the source of a fact. Google’s Page Rank is a simple example of this, but it can be much more manual (eg, facts extracted from the CIA World Fact Book are probably more reliable than from Joe’s random website)

Even manual curation is hard. My system currently answers “Who is the president of the US?” with “The president of United States is John Roberts” (ie, the chief justice). This is because I’ve manually mapped the term “president” to “leader”. The DBPedia project has then flattened the concept “constitutional authority” to “leader”. This system of manual mapping means my system gives the correct answer in most cases, but in one important case it breaks (It’s easy enough to fix this, but I think it’s quite instructive).

The published state of the art for fact extraction is Google’s very recent “Knowledge Vault” paper: http://www.cs.cmu.edu/~nlao/publication/2014.kdd.pdf. In is they talk about four main fact extraction methods they use on the web: Text parsing, HTML DOM parsing (they have a classifier trained to extract entities and relationships from the DOM), HTML Tables (over 570 million tables!) and Human Annotations (using metadata in the page). Interestingly, the DOM method generates the most facts and in the most accurate (1.2B facts, 8% high confidence).

For people still hoping semantic metadata will be a meaningful player: 140M facts, and only 0.2% of them are scored high confidence.

Understanding the question

Most system regard this is a natural language processing problem. Interestingly you can get quite a long way by dropping stopwords and searching on keywords, but that won’t work past a certain point. The state-of-the-art for this is anything that can understand “Jeopardy”-type questions. These questions are often quite difficult for humans to parse, so Watson’s accomplishment in handling these is pretty impressive.

Watson handles this using a NLP parser to build a syntax tree. This tree is then converted to Prolog facts, and rules are then built using this fact-relationship-tree which are passed down to the Watson evidence evaluation system.

It’s interesting to look at a set of related questions to see how the complexity explodes:

  1. Who is Bill Clinton?
  2. Who is Bill Clinton’s daughter?
  3. Who is the 42nd president?
  4. List the children of the 42nd president

At the time of writing Google, Bing and Wolfram Alpha give the answer to question one separately to their search results.

Google and Wolfram Alpha give the answer to the second question separately to search results, while Bing links to Chelsea Clinton as the first result, and gives a “more information” link about her in the side bar.

Google can answer the third question. Wolfram Alpha needs to have the question asked as “Who is the 42nd US president” and then gives the correct answer. Bing links to Bill Clinton.

None of the systems can handle the fourth question.

Further reading:

There are other approaches. In particular this recent paper takes a deep-learning approach with some pretty impressive results.

 Generating an answer

Generating an answer involves using the understandings generated in step two to query the set of facts extracted in step one, and then finding the most likely answer.

Most people assume that QA systems try to find a single answer. The truth is more subtle: most seems to try to generate lots of possible answers and then have a pruning and ranking step to narrow those answers down to the most likely.

This is akin to how Google uses page rank to rank pages in web search: there are many pages that match keywords, but Google tries to put the most useful first.

In a QA system, this ranking may be done by having some kind of confidence associated with each fact that asserted in the system.

The most commonly used example of this is the question “where was Obama born?”. Searching the web, there are many assertions that he was born in Kenya, and there are many that he was born in Hawaii. A simple system could rank them on the number of assertions found – any in many cases this may work fine. However, it is vulnerable to gaming, so some kind of trust score is probably more advisable.

Watson takes an interesting approach to this. It will generate an English version of the answer, and then rapidly scan for evidence supporting that answer.

For example, for the question “List the children of the 42nd president” it may generate lists of all the children of all the presidents. Then it uses those children and the phrase “the 42nd president” as keywords to see how often they are found together.

That approach works well because Watson has manually-curated sources of information (ie, it is unlikely to be gamed). It would be excellent if this approach could be generalised to work against less-trusted sources of information.


In summary, automated question answering is an extremely broad field, and “solving” it involves solving a number of subproblems, each of which is hard in itself.

Both statistical machine learning methods and knowledge engineering methods are needed to build a complete system. In my view is seems likely that breakthroughs will come by using machine learning to do the knowledge engineering.

I believe that the improving nature of QA systems shows what amazing progress the field of AI has made over the last 10 years.

AI will always remain “that things that computers can’t do”. But in the QA field at least it’s clear that computer system are already approching human abilities.


2 thoughts on “Why question answering is hard

  1. Hi Nick,

    Very interesting article. I would love to bounce some Q&A ideas and doubts your way to get your feedback. Could you send me an email so we can get in touch?


  2. Hello,

    I recently found this post of yours on HN: https://news.ycombinator.com/item?id=7148667 Anyway I’m very curoius to find out what you know and have done implementation wise as I’ve had a very similar idea for a while now but just haven’t known how to build it.

    In particular I just think I need the English parsing side of things, a Q&A system or parts of a Q&A system to put together with some other things like Sympy and I believe with Sympy and Python a better system could be built.

Leave a Reply

Your email address will not be published. Required fields are marked *