Been a while since I posted here. I spent 2014-2018 running data science and machine learning R&D programs at D2DCRC.
NIPS is the premier conference on Deep Learning. Given the accelerating state of the art, it’s interesting to see what is new.
The paper list is available from http://www.dlworkshop.org/accepted-papers. These are the papers that stood out to me (or at least matched my interests).
cuDNN: Efficient Primitives for Deep Learning: A library from nVidia for deep learning on GPUs. ~36% speedup on training using a K40. Has Caffe integration (which has quickly become the standard Deep Learning library).
Distilling the Knowledge in a Neural Network: Compressing complex knowledge representations into simpler models. I need to read this a few more times to understand it well enough to comment sensibly. By Oriol Vinyals, Geoffrey Hinton and Jeff Dean, so there’s that…
Learning Word Representations with Hierarchical Sparse Coding: From the Ark group at CMU. An alternative to Word2Vec for understanding word semantics. Results seems roughly comparable to Word2Vec (which is good! Word2Vec is pretty much one of the miracles of the age.) There is a claim that training is significantly faster than previous methods: 2 hours to train vs 6.5 for Word2Vec on the same 6.8 billion word corpus. See the Paragraph Vectors paper as well.
Deep Learning as an Opportunity in Virtual Screening: Deep Learning for drug screening. I know nothing about drug screening, but this seems pretty significant:
Deep learning outperformed all other methods with respect to the area under ROC curve and was significantly better than all commercial products. Deep
learning surpassed the threshold to make virtual compound screening possible and
has the potential to become a standard tool in industrial drug design.
Document Embedding with Paragraph Vectors: “Lady Gaga” – “American” + “Japanese” = “Ayumi Hamasaki“. Ayumi Hamasaki “has been dubbed the “Empress of J-Pop” because of her popularity and widespread influence in Japan and throughout Asia… Since her 1998 debut with the single recording “Poker Face”,Hamasaki has sold over 53 million records in Japan, ranking her among the best-selling recording artists in the country“.
Explain Images with Multimodal Recurrent Neural Networks: Image captioning. There’s been a few papers on this floating around lately, seemingly in preparation for CVPR 2015. This one is from UCLA & Baidu. Others include Stanford, Google, Berkeley, Microsoft and University of Toronto. Someone should go and compare them all..
Deep Learning for Answer Sentence Selection: Selecting sentences that answer a question using word/sentence semantics only. This is a weird approach, but appears to work well in some circumstances. I don’t think it would be possible to build a QA system around it, but it could be a useful adjunct to other more traditional methods.
- I haven’t tried this yet, but the examples are very impressive. “We introduce a recursive neural network model that is able to correctly answer paragraph-length factoid questions from a trivia competition called quiz bowl. Our model is able to succeed where traditional approaches fail, particularly when questions contain very few words (e.g., named entities) indicative of the answer.” http://cs.umd.edu/~miyyer/qblearn/
- IBM finally opening up the Watson system with an API. Allegedly the way to get access to this is via the BlueMix PAAS. https://developer.ibm.com/watson/docs/developing-watson-apis/ (I’ve tried this now. The Question Answering API is pretty much untrained and mostly gives bad results. However, its confidence scoring is very good, ie, if it gives a bad answer it will have a low confidence score, whereas answers with a 90%+ score are almost always right)
- The paper on Google’s Knowledge Vault. I thought I’d posted this already: http://www.cs.cmu.edu/~nlao/publication/2014.kdd.pdf
- DeepDive from Stanford. This used to be Wisci from UW-Madison. http://deepdive.stanford.edu/index.html. It does probabilistic inference on unstructured data.
- https://github.com/percyliang/sempre SEMPRE is a toolkit for training semantic parsers, which map natural language utterances to denotations (answers) via intermediate logical forms.
I’ve spent a fair bit of time building my own QA system. As is often the way, that’s given me some insight into what a big problem space this is.
To be clear: I think QA is a AI-Complete problem. Any system that works on a subset of QA is a good achievement.
Most QA systems have roughly 3 parts:
- Fact extraction
- Understanding the question
- Generating an answer
Fact extraction means understanding domain-specific data in a way that allows your system to use it to answer questions.
At its simplest, this means building indexes of keywords so a simple keyword search will match documents containing facts. That goes someway towards question answering, but probably isn’t enough in most domains to build a state-of-the-art system.
Fact extraction usually has two parts:
- Entity Extraction
- Relation Extraction
Entity extraction means finding “things” that facts are about, and what type a thing is. To a large extent this is a Natural Language Processing (NLP) process: find nouns in a paragraph and they are probably entities.
Relation Extraction is related (ha!). It means understanding how entities mentioned in some text relate and possibly how they relate to other entities the system knows about.
Consider the sentence: “Jane was born in Birmingham, England”.
A good system will extract the following:
- Jane is a person-type entity.
- Birmingham is a place-type entity
- England is a place-type entity
- Jane was born in Birmingham
- Birmingham is part of England
There are some reasonable open source systems for entity and relation extraction. The Stanford Named Entity Recognition software is probably the best known. The Python NLTK library can do it, and most recently the DARPA-funded MITIE project looks pretty nice.
“Inference” is sometimes considered part of fact extraction (or it may be considered part of answering). Inference means generating new facts by inferring them from others using rules. For example, from the sentence above we can infer that Jane was born in England (because she was born in Birmingham, and Birmingham is part of England).
There are a number of ways to do inference. The Prolog language was pretty much designed for it, and many “semantic web” datastores and query systems have inference engines built in.
An alternate way of doing Fact Extraction is manually. This sounds extremely stupid, but is actually reasonably common. Notably, Wolfram Alpha is apparently manually curated.
Most systems use a hybrid: for example, many will use a manually curated fact base (eg, Freebase and/or DBPedia) to attempt to verify automatic entity generation.
One thing to be aware of with this part of the system is that “facts” will often conflict so the system needs a way of resolving that conflict. The reasons for these conflicts are many, but some of the most obvious are that data changes over time (“How many euros in a US dollar?”) and that human biases mean facts aren’t as clear as they should be (“Where was Obama born”).
Most systems deal with this by attempting to assign some kind of reliability score to the source of a fact. Google’s Page Rank is a simple example of this, but it can be much more manual (eg, facts extracted from the CIA World Fact Book are probably more reliable than from Joe’s random website)
Even manual curation is hard. My system currently answers “Who is the president of the US?” with “The president of United States is John Roberts” (ie, the chief justice). This is because I’ve manually mapped the term “president” to “leader”. The DBPedia project has then flattened the concept “constitutional authority” to “leader”. This system of manual mapping means my system gives the correct answer in most cases, but in one important case it breaks (It’s easy enough to fix this, but I think it’s quite instructive).
The published state of the art for fact extraction is Google’s very recent “Knowledge Vault” paper: http://www.cs.cmu.edu/~nlao/publication/2014.kdd.pdf. In is they talk about four main fact extraction methods they use on the web: Text parsing, HTML DOM parsing (they have a classifier trained to extract entities and relationships from the DOM), HTML Tables (over 570 million tables!) and Human Annotations (using metadata in the page). Interestingly, the DOM method generates the most facts and in the most accurate (1.2B facts, 8% high confidence).
For people still hoping semantic metadata will be a meaningful player: 140M facts, and only 0.2% of them are scored high confidence.
Understanding the question
Most system regard this is a natural language processing problem. Interestingly you can get quite a long way by dropping stopwords and searching on keywords, but that won’t work past a certain point. The state-of-the-art for this is anything that can understand “Jeopardy”-type questions. These questions are often quite difficult for humans to parse, so Watson’s accomplishment in handling these is pretty impressive.
Watson handles this using a NLP parser to build a syntax tree. This tree is then converted to Prolog facts, and rules are then built using this fact-relationship-tree which are passed down to the Watson evidence evaluation system.
It’s interesting to look at a set of related questions to see how the complexity explodes:
- Who is Bill Clinton?
- Who is Bill Clinton’s daughter?
- Who is the 42nd president?
- List the children of the 42nd president
At the time of writing Google, Bing and Wolfram Alpha give the answer to question one separately to their search results.
Google and Wolfram Alpha give the answer to the second question separately to search results, while Bing links to Chelsea Clinton as the first result, and gives a “more information” link about her in the side bar.
Google can answer the third question. Wolfram Alpha needs to have the question asked as “Who is the 42nd US president” and then gives the correct answer. Bing links to Bill Clinton.
None of the systems can handle the fourth question.
There are other approaches. In particular this recent paper takes a deep-learning approach with some pretty impressive results.
Generating an answer
Generating an answer involves using the understandings generated in step two to query the set of facts extracted in step one, and then finding the most likely answer.
Most people assume that QA systems try to find a single answer. The truth is more subtle: most seems to try to generate lots of possible answers and then have a pruning and ranking step to narrow those answers down to the most likely.
This is akin to how Google uses page rank to rank pages in web search: there are many pages that match keywords, but Google tries to put the most useful first.
In a QA system, this ranking may be done by having some kind of confidence associated with each fact that asserted in the system.
The most commonly used example of this is the question “where was Obama born?”. Searching the web, there are many assertions that he was born in Kenya, and there are many that he was born in Hawaii. A simple system could rank them on the number of assertions found – any in many cases this may work fine. However, it is vulnerable to gaming, so some kind of trust score is probably more advisable.
Watson takes an interesting approach to this. It will generate an English version of the answer, and then rapidly scan for evidence supporting that answer.
For example, for the question “List the children of the 42nd president” it may generate lists of all the children of all the presidents. Then it uses those children and the phrase “the 42nd president” as keywords to see how often they are found together.
That approach works well because Watson has manually-curated sources of information (ie, it is unlikely to be gamed). It would be excellent if this approach could be generalised to work against less-trusted sources of information.
In summary, automated question answering is an extremely broad field, and “solving” it involves solving a number of subproblems, each of which is hard in itself.
Both statistical machine learning methods and knowledge engineering methods are needed to build a complete system. In my view is seems likely that breakthroughs will come by using machine learning to do the knowledge engineering.
I believe that the improving nature of QA systems shows what amazing progress the field of AI has made over the last 10 years.
AI will always remain “that things that computers can’t do”. But in the QA field at least it’s clear that computer system are already approching human abilities.
- Actual, real guidance on how to secure Docker containers – what is possible and what isn’t. http://www.slideshare.net/jpetazzo/is-it-safe-to-run-applications-in-linux-containers
- Google building a fact base by extracting facts from the broad web: http://www.newscientist.com/article/mg22329832.700-googles-factchecking-bots-build-vast-knowledge-bank.html#.U_ctvbySxmM
- MIT Information Extraction: state-of-the-art information extraction tools. The current release includes tools for performing named entity extraction and binary relation detection as well as tools for training custom extractors and relation detectors. https://github.com/mit-nlp/MITIE
- http://googleresearch.blogspot.com/2014/09/building-deeper-understanding-of-images.html Labelling parts of images. The examples are pretty impressive.
- http://public.dhe.ibm.com/common/ssi/ecm/en/gbe03620usen/GBE03620USEN.PDF IBM Whitepaper on their vision for using blockchains to “power” the Internet-of-Things. See also https://gigaom.com/2014/09/09/check-out-ibms-proposal-for-an-internet-of-things-architecture-using-bitcoins-block-chain-tech/
- Penn Treebank II tags: https://gist.github.com/nlothian/9240750. Because they aren’t actually documented anywhere except one person’s thesis. And now that is offline.
- Knowledge Extraction from text. Looks good, pity about the license: http://knowitall.github.io/openie/
- AirBNB service discovery, including autoconfig for Docker/HAProxy https://github.com/airbnb/synapse#docker
- DNS with an API https://github.com/skynetservices/skydns
- BTSync on Ubuntu 12.04. Interesting, too bad BTSync isn’t open source.
- Dashing, from Shopify. Framework for attractive dashboards.
- Gridster. Gridster is a jQuery plugin that allows building intuitive draggable layouts from elements spanning multiple columns
- Prediction.io PredictionIO is an open source machine learning server for software developers to create predictive features, such as personalization, recommendation and content discovery.
- MBox. Mbox is a lightweight sandboxing mechanism that any user can use without special privileges in commodity operating systems.
Currently working on the most stupid idea I’ve ever had. It’s so dumb that it is pretty much guaranteed to fail.
- BayesDB. Query the probable implications of your data as easily as a SQL database lets you query the data itself. eg: INFER salary FROM mytable WHERE age > 30; I think I just saw the future… Also ALPS is somewhat related, for Postgresql.
- Why Cognition-as-a-Service is the next operating system battlefield – something I’m interested in.
- A Programmer’s Guide to Data Mining – looks pretty good.
- BaseKB – cleaned up Freebase data
- Mission Control is here. Java profiling gets even better.
Am I being stupid for getting sucked into the RDF wormhole? It’s almost a parallel universe, but is directly relevant both for work and a private project I’m working on. <Sigh>
- TDB Java API – because the old version of the Jena datastore that ran on a database is now only in maintenance mode.
- Comparison of Triple Stores [PDF] – a pretty decent comparison. This is telling regarding inference: All the off the shelf reasoners available expect the data to be cached in-memory to perform the reasoning
- http://decaf.berkeleyvision.org/ – image recognition using deep learning. Pretty impressive. Code is open for non-commercial use. Deep Learning algorithms running on GPUs seem to have been a real breakthough. http://deeplearning.net/tutorial/lenet.html#running-the-code shows benchmarks for the same algorithm on an i7 (380.28m) vs a GeForce GTX 480 (32.52m).
- Skydb – Sky is an open source database used for flexible, high performance analysis of behavioral data. For certain kinds of data such as clickstream data and log data, it can be several orders of magnitude faster than traditional approaches such as SQL databases or Hadoop.
- http://build.porteus.org/ – custom build Linux distro, then download it.