Thursday, July 1, 2010

Guest post: Travellr.com talks about building your own location matcher

(Editor's note: The final post in our guest blogger series for DevFest AU 2010 is from Michael Shaw, Travellr.com's Senior Architect responsible for Travellr's core technology stack. Michael works on Travellr.com's real-time Q&A recommendation engine, natural language processor, and location database. Michael's in-house research also includes performance scalability and reputations systems. Outside of work, Michael is a self-confessed "foodie" and an expert locator of amazing restaurants. Warning: this post is highly technical, it's for the experts only ... but it's such a cool technical achievement, we wanted to share more widely ).

Google Maps provide a really handy geocoding API which can be used to extract locations from text input, for example - "Byron Bay, Australia". However the Google Maps geocoding API expects an "address" to geocode, so what do you do if you have a complex sentence like "Where is a nice beach near Byron Bay to go surfing?". In this case, the geocoding API returns zero results, and you can't find a match. It's time to roll up your sleeves and build your own location extractor!

With a little help from some open-source natural language processing (NLP) toolkits you can extract locations from text input and be on your merry way to mapping them on Google Maps. And we're going to show you how!


Location extraction in a nutshell
We developed a practical NLP based solution for Travellr.com using OpenNLP, an open-source natural language processing library. OpenNLP provides many language processing tools, including a Sentence Detector, Tokenizer, Part of Speech Tagger and a Chunker. We used these tools to construct a Named Entity Recognizer (NER) to match locations in free-form text. (We chose to build our own named entity recogniser over existing libraries so we could tailor it to casual internet text as well as use our location database).

Our approach is to find words that might be the name of a location, match them against our locations, and then rank the results to determine the most probable match.

The process of our location extractor is illustrated below:
Here are our steps detailed a little more:

Step 1. Using a sentence detector and a tokenizer, we break the text into individual words. While it seems simple at first, simply splitting sentences based on punctuation can lead to problems; the sentence "Where's the best place to buy bike for less than $200.00 in Sydney?" would be split on the full stop in "$200.00" if we didn't use a smarter sentence detector.

Step 2. We then tag each of the words (tokens) with what's called a "Part of Speech". Parts of speech are similar to the categories nouns/verbs/adjectives/adverbs etc. but are finer grained categories based on semantic distinctions. This helps to highlight proper nouns (which can often be location names), as well as prepositions which are often used when talking about location centric content.

Step 3. We then use a "Chunker" to identify and group short phrases together. This can be extremely useful in identifying multi-word location names, as in our example of "Byron Bay".

Step 4. We then give each of the words a numeric score based on the likelihood that they're part of a location name.

The following features are examined to calculate a score for each word:
- How common the word is (i.e. 'it', 'and' are common)
- Current word part of speech (i.e. 'proper noun' is a positive indicator)
- Previous word part of speech
- Is the current phrase/chunk preceded by a preposition (i.e. 'surfing in Sydney'?)
- Capitalisation (i.e. 'orange' vs 'Orange')
- Chunk labels (locations are often in noun phrases, ie: 'Byron Bay', 'Las Vegas' )
- Proximity of predictive words (nearby words that can help denote the presence of a location name, i.e. 'in', 'to', 'near')

Step 5. Once we calculate the words scores, we search for the words with high scores against our location database. We also have a second pass which helps identify which is the best match. The second pass involves things such as the word "Australia" when near "Sydney" reinforces that we might be talking about "Sydney, Australia" instead of "Sydney, Canada", as well as typo/spelling correction.

Mapping locations on World Nomads Journals
We decided to put our location extractor to the test on more than 37,000 blog entries hosted on World Nomads' Journals, a free travel blogging service provided by WorldNomads.com. We plotted the results on Google Earth and the results were amazing - we were able to map blog content to exact locations on the globe. In the example below, we matched the blog text "At Leigh Creek we came across the huge open cut coal mine" to Leigh Creek in South Australia. This is good example of what you can achieve with your own location extractor and Google's mapping tools.


Try it out yourself
You can try out Travellr's location extractor using our online location web service, which also includes some debugging information about the word scoring. Travellr's location extractor is provided as a REST web service built on Scala + Step and deployed on Jetty.

Further reading and tools:
If you'd like to learn more about natural language processing, check out the following resources:
Learning Resources:
- NLTK Book (free & open, companion of the Natural Language Toolkit, an awesome introduction to NLP)
- Foundations for Statistical Natural Language Processing (Manning & Schuetze)

Libraries:
- OpenNLP (java, MaxEnt based, thread safe, what we presently use for sentence detection, tokenization, tagging and chunking)
- Stanford NLP Tools (java based, not thread safe)