Microsoft aims to build a better thesaurus

A team of researchers at Microsoft is looking to beat Roget at his own game.
Aiming to build a better thesaurus, the Writing Assistance project within Microsoft's research unit is tapping techniques developed to translate from one language to another.
Although thesauri are good at finding lots and lots of synonyms, they require the user to pick the right one because they aren't very good at understanding the context of what is being said. That's where the experience from doing machine translations comes in.
"We've taken the actual translation tables...and what we've done is we've taken those and said if a word in Chinese maps to two different English words maybe those two words are synonyms with some probability," said Christopher Brockett, a computational linguist and one of the Microsoft researchers leading the project.
The approach has two key benefits over a static thesaurus. First of all, the newer approach can do phrases, as opposed to single words. Also, it can draw on the context in which the phrase is used.
Brockett plans to show off a prototype of the tool at TechFest, Microsoft's annual internal science fair. It's just one of dozens of projects that will be shown as part of an effort to expose Microsoft's business units to the work being done in Microsoft's research labs.
TechFest is sort of like "The Dating Game" for Microsoft's research and product development arms. Research teams at Microsoft set up booths, somewhat like a high-school science fair, while product teams shuffle through looking for something that might give their efforts a leg up on the competition.
For the public, TechFest can also offer a glimpse at future product directions. For example, researcher Andy Wilson showed off a number of surface computing projects in the years leading up to the debut of Microsoft's Surface product.
As is the case with most of the projects, the thesaurus effort is still in its infancy.
"We're still working on the algorithms and how much work we give to the language pairs," Brockett said. "We have to get the quality up. There are usability issues that have to be looked into."
Over time, though, Brockett hopes the technique could be used to effectively translate whole sentences. Microsoft has a demonstration of that up on its Web site, but Brockett acknowledges such a treatment shows both the potential and the current limitations of the technology.
But would-be high-school plagiarists beware. Yes, the technology could someday translate the whole Wikipedia article for you, but it would likely translate the article the same way for all your classmates as well. And plagiarism detection software is evolving along with the science of machine translation.
As for the thesaurus itself, the technology would be a natural fit for Word, which already has a built-in traditional thesaurus. But the technology could also help Microsoft in another key area: search.
That's because while search engines are good at finding things like names, that have just one form, they have a harder time finding expressions that can be phrased in multiple ways.
That's less of an issue when searching across the whole Web. For example, searching "Who shot Abraham Lincoln?" "Who killed Abraham Lincoln" and "Who assassinated Abraham Lincoln" all direct you to a page with John Wilkes Booth.
However, when it comes to searching smaller universes, such as a company's intranet, that might not be the case.
"You might not find it if the words are different," Brockett said. In such cases, automatically searching using similar phrases might boost the likelihood of finding a result.
This article was first published as a blog post on CNET News.