Why question answering is hard

IBM recently semi-released their Watson APIs. Judging from the comments on the HN thread many don’t realize just how hard question answering (QA) is, nor just how good Watson is.

I’ve spent a fair bit of time building my own QA system. As is often the way,Â that’s given me some insight into what a big problem space this is.

To be clear: I think QA is a AI-Complete problem. Any system that works on a subset of QA is a good achievement.

Most QA systems have roughly 3Â parts:

Fact extraction
Understanding the question
Generating an answer

Fact Extraction

Fact extraction means understanding domain-specific data in a way that allows your system to use it to answer questions.

At its simplest, this means building indexes of keywords so a simple keyword search will match documents containing facts. That goes someway towards question answering, but probably isn’t enough in most domains to build a state-of-the-art system.

Fact extraction usuallyÂ has two parts:

Entity Extraction
Relation Extraction

Entity extraction means finding “things” that facts are about, and what type a thing is. To a large extent this is aÂ Natural Language Processing (NLP) process: find nouns in a paragraph and they are probably entities.

Relation Extraction is related (ha!). It means understanding how entities mentioned in some text relate and possibly how they relate to other entities the system knows about.

ConsiderÂ the sentence: “Jane was born in Birmingham, England”.

A good system will extract the following:

JaneÂ is a person-type entity.
Birmingham is a place-type entity
England is a place-type entity
Jane was born inÂ Birmingham
Birmingham is part of England

There are some reasonable open source systems for entity and relation extraction. The StanfordÂ Named Entity Recognition software is probably the best known. The Python NLTK library can do it, and most recently the DARPA-fundedÂ MITIE project looks pretty nice.

“Inference” is sometimes considered part of fact extraction (or it may be considered part of answering). Inference means generating new facts by inferring them from others using rules. For example, from the sentence above we can infer that Jane was born in England (because she was born inÂ Birmingham, andÂ Birmingham is part of England).

There are a number of ways to do inference. The Prolog language was pretty much designed for it, and many “semantic web” datastores and query systems have inference engines built in.

An alternate way of doing Fact Extraction is manually. This sounds extremely stupid, butÂ is actually reasonably common. Notably, Wolfram Alpha is apparently manually curated.

Most systems use a hybrid: for example, many will use a manually curated fact base (eg, Freebase and/or DBPedia) to attempt to verify automaticÂ entity generation.

One thing to be aware of with this part of the system is that “facts” will often conflict so the system needs a way of resolving that conflict. The reasons for these conflicts are many, but some of the most obvious are that data changes over time (“How many euros in a US dollar?”) and thatÂ human biases mean facts aren’t as clear as they should be (“Where was Obama born”).

Most systems deal with this by attempting to assign some kind of reliability score to the source of a fact. Google’s Page Rank is a simple example of this, but it can be much more manual (eg, facts extracted from the CIA World Fact Book are probably more reliable than from Joe’s random website)

Even manual curation is hard. My system currently answers “Who is the president of the US?” with “The president of United States is John Roberts” (ie, the chief justice). This is because I’ve manually mapped the term “president” to “leader”. The DBPedia project has then flattened the concept “constitutional authority” to “leader”. This system of manual mapping means my system gives the correct answer in most cases, but in one important case it breaks (It’s easy enough to fix this, but I think it’s quite instructive).

The published state of the art for fact extraction is Google’s very recent “Knowledge Vault” paper:Â http://www.cs.cmu.edu/~nlao/publication/2014.kdd.pdf. In is they talk about four main fact extraction methods they use on the web: Text parsing, HTML DOM parsing (they have a classifier trained to extract entities and relationships from the DOM), HTML Tables (over 570 million tables!) and Human Annotations (using metadata in the page). Interestingly, the DOM method generates the most facts and in the most accurate (1.2B facts, 8% high confidence).

For people still hoping semanticÂ metadata will be a meaningful player: 140M facts, and only 0.2% of them are scored high confidence.

Understanding the question

Most system regard this is a natural language processing problem.Â Interestingly you can get quite a long way by dropping stopwords and searching on keywords, but that won’t work past a certain point. The state-of-the-art for this is anything that can understand “Jeopardy”-type questions. These questions are often quite difficult for humans to parse, so Watson’s accomplishment in handling these is pretty impressive.

Watson handles this using aÂ NLP parser to build a syntax tree. This tree is then converted to Prolog facts, and rules are then built using this fact-relationship-tree which are passed down to the Watson evidence evaluation system.

It’s interesting to look at a set of related questions to see how the complexity explodes:

Who is Bill Clinton?
Who is Bill Clinton’s daughter?
Who is the 42nd president?
List the children of the 42nd president

At the time of writing Google, Bing and Wolfram Alpha give the answer to question one separately to their search results.

Google andÂ Wolfram Alpha give the answer to the second question separately to search results, while Bing links toÂ Chelsea Clinton as the first result, and gives a “more information” link about her in the side bar.

Google can answer the thirdÂ question. Wolfram Alpha needs to have the question asked as “Who is the 42nd US president” and then gives the correct answer.Â Bing links to Bill Clinton.

None of the systems can handle the fourth question.

Â Generating an answer

Generating an answer involves using the understandings generated in step two to query the set of facts extracted in step one, and then finding the most likely answer.

Most people assume thatÂ QA systems try to find a single answer. The truth is more subtle: most seems to try to generate lots of possible answers and then have a pruning and ranking step to narrow those answers down to the most likely.

This is akin to how Google uses page rank to rank pages in web search: there are many pages that match keywords, but Google tries to put the most useful first.

In a QA system, this ranking may be done by having some kind of confidence associated with each fact that asserted in the system.

The most commonly used example of this is the question “where was Obama born?”. Searching the web, there are many assertions that he was born in Kenya, and there are many that he was born in Hawaii. A simple system could rank them on the number of assertions found – any in many cases this may work fine. However, it is vulnerable to gaming, so some kind of trust score is probably more advisable.

Watson takes an interesting approach to this. It will generate an English version of the answer, and then rapidly scan for evidence supporting that answer.

For example, for the question “List the children of the 42nd president” it may generate lists of all the children of all the presidents. Then it uses those children andÂ the phrase “the 42nd president” as keywords to see how often they are found together.

That approach works well because Watson has manually-curated sources of information (ie, it is unlikely to be gamed). It would be excellent if this approach could be generalised to work against less-trusted sources of information.

Summary

In summary, automated question answering is an extremely broad field,Â and “solving” it involves solving a number of subproblems, each of which is hard in itself.

Both statistical machine learning methods and knowledge engineering methods are needed to build a complete system. In my view is seems likely that breakthroughs will come by using machine learning to do the knowledge engineering.

I believe that the improving nature of QA systems shows what amazing progress the field of AI has made over the last 10 years.

AI willÂ always remain “that things that computers can’t do”. But in the QA field at least it’s clear that computer systemÂ are already approchingÂ human abilities.

2 thoughts on “Why question answering is hard”

Daniel says:

September 26, 2014 at 4:53 am

Hi Nick,

Very interesting article. I would love to bounce some Q&A ideas and doubts your way to get your feedback. Could you send me an email so we can get in touch?

Thanks
Jack says:

May 11, 2015 at 2:32 am

Hello,

I recently found this post of yours on HN: https://news.ycombinator.com/item?id=7148667 Anyway I’m very curoius to find out what you know and have done implementation wise as I’ve had a very similar idea for a while now but just haven’t known how to build it.

In particular I just think I need the English parsing side of things, a Q&A system or parts of a Q&A system to put together with some other things like Sympy and I believe with Sympy and Python a better system could be built.

BadMagicNumber

My Blog, Take 4