Deep Natural Language Processing for Web-scale Search

Ronald M. Kaplan

Abstract

Proceedings of LFG09; CSLI Publications On-line

Conventional key-word search technologies have been remarkably successful at making vast amounts of information available to ordinary users. They achieve robustness and scale by creating efficient bag-of-words indexes of the terms they extract from unstructured text and by encouraging users to specify their information needs with keywords that are well-suited to bag-of-words retrieval. These methods suffer from errors of both precision and recall. Undesired results are returned because the systems do not index and cannot filter according to the semantic relations that the user has in mind, and desired results are missed because keyword matches cannot identify passages that use different terms and different syntactic constructions to express semantically equivalent concepts.

It is not a novel idea that these precision and recall problems can be addressed in principle by using deep natural language processing to extract underlying semantic concepts and relations both from text and from queries. But it has proven difficult to put these ideas into practice at large scale. This talk describes the LFG-based natural language pipeline that was developed jointly by researchers and developers at (Xerox) PARC and at Powerset, an LFG start-up company. We have combined fairly mature linguistic technologies with carefully tuned indexing and retrieval components to build a large-scale natural language search capability that is now available on the web as part of Microsoft Bing. The LFG architecture not only provides for more accurate search results, it also allows semantic relationships to be respected in the way that search results are arranged and presented.