Many years ago - I'd guess four, or five - I read that Google was developing a new technology for machine translation of human languages. This is a famously difficult task, one that people were initially optimistic would be quite tractable, but has turned out not to be. Google's new approach was to take "billions of words" of source material that had been painstakingly translated by humans, and have the computer build statistical rules for translation. It turns out there's a lot of this source material, out there - things like UN internal documents and EU laws, to say nothing of famous old novels like Moby Dick.
I heard nothing else about it, and had almost forgotten about it. It seems I'd missed the announcement in spring of 2006 that they'd gone live with a version of this for English:Arabic and vice-versa. Today, they switched to the new system for all languages.
Their old system was provided by Systran. Like MapQuest, they're now in the unenviable position of having Google suddenly put out a product that's noticeably better than yours. Since Systran is still used by the best-name-in-class Babel Fish, it's quite possible to compare the two methods.
Some of the features in the new Google Translate may not be new, but I'm impressed with them. I've eschewed automatic translation in the past as being something that, if you tried to read Moby Dick with it, would probably be able to let you know it was about a whale. As a result, some of what I'm about to describe may not be completely new.
I do not, in any way, know French. I found a Le Monde article about the California wildfires, and ran it through both Google Translate and Babel Fish.
Unfortunately, both stumble out the gate. The story attempts to emulate the look of a print publication by having the first letter of the first sentence of the story be a large capital. Unfortunately, they do this not by increasing fonts, but by using a picture of a capital "C". This makes the opening phrase, "Cinq morts" ("Five dead") appear to the translators to be "inq morts". Both of them don't know the French word, "inq" (as it appears there is no such word) and simply leave it there, rendering it as "Cinq dead", with the "C" being an actual picture of a "C". This underscores a lot of the difficulties this field faces. Giving the benefit of the doubt, when I by-hand translated each opening phrase with "inq" replaced by "cinq", this is what I got:
Original: Cinq morts et 500 000 personnes évacuées : le bilan des incendies qui ravagent le sud de la Californie depuis trois jours ne cesse de s'alourdir, alors que le présdident Bush est attendu dans la région mercredi 24 octobre.
Babel Fish: Five died and 500 000 evacuated people: the assessment of the fires which have devastated the south of California for three days does not cease being weighed down, whereas Bush présdident it is awaited in the area Wednesday October 24.
Google: Five people dead and 500,000 evacuees an assessment of fires ravaging southern California for three days continues to grow, while the présdident Bush is expected in the region Wednesday, October 24.
The Babel Fish translation is quite difficult to follow - you can figure out that five are dead and 500k displaced, and Bush is coming on Wednesday, and the fires are in southern California. I'm pretty sure the original sentence was saying something more like "Five people dead and 500,000 evacuees: damage assessments of fires that have ravaged souther California for three days continues to grow, while President Bush is expected in the region Wedneday, October 24." The Google translation is much better, but far from perfect.
Further in the article, from Babel Fish, we learn, "The zone around San Diego was touched hard by the flames, which were propagated with whole districts. The localities of Rancho Bernardo, Fallbrook and Ramona presented, Tuesday, of the scenes of apocalypse, with houses reduced in ashes and carcasses of cars strewing the streets.
Lodging houses were installed, including one in a stage where were, Tuesday, some 20 000 moved. More than 1 660 km2 on the whole left in smoke, according to the Californian administration." I love the image of lodging houses being installed, including one in a stage. As always, it's quite possible to puzzle this all out - San Diego was "hit hard" (not "touched hard") by fire, which went across whole districts. Particular places looked apocolyptic, and they set up emergency shelters for the evacuated, while 1,660 square kilometres - I mean, kilometers - went up in smoke.
Google handles this far from flawlessly, but better: "The area around San Diego has been the hardest hit by the flames, which have spread to entire neighborhoods. The locations in Rancho Bernardo, Fallbrook and Ramona showed on Tuesday, scenes of apocalypse, with houses reduced to ashes and carcasses of cars jonchant the streets.
The shelters were set up, including one in a stadium whereabouts, Tuesday, some 20,000 displaced people. More than 1660 km2 in total went up in smoke, according to the administration of California."
This does much better with idioms! "up in smoke" and "hit hard" are both translated idiomatically, as opposed to Babel Fish's "left in smoke" and "touched hard," respectively. Google rather unaccountably misses the word "jonchant", which, in-context does seem to mean what Babel Fish suggests, "strewing". This isn't a completely bizarre word, either, Google finds 48,200 web pages using it (now, 48, 201). Indeed, this French dictionary, which I can read only because of Google's translation tool, defines it as " present participle of the verb joncher," and lists "covering, disseminating, parsemant, covering lining" as synonyms. Google doesn't know "joncher," either. I've noticed this seems to be a general failing in the current Google system - a number of words come through, untranslated. The Systran system underpinning Babel Fish seems to have a more thorough simple word-to-word dictionary.
This may be a side effect of Google's training documents. If your sources are all boring legal documents, you may not get much in the way of poetic words like "strewing". Presumably, this is improvable with more documents. Google also has an amazing interface to translated pages. I could see this being a real boon to those trying to learn a language. If you hover your mouse over a sentence, it pops up a bubble with the original language in it. This allows you to see the structure of the original sentence, which can be helpful in places where significant reordering of words has been performed by the software. Additionally, for those fluent (or for extremely obvious mistakes), Google provides a "Suggest a better translation" link. If they're proactive about getting those improvements in, the system could get a lot better, quickly.
One reasonably obvious test of automatic systems is to run the output as the input - translate from English to French, and back again. Google, again, does noticeably better. The first lines of Mr. Melville's novel in the original English are "Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world." After a round trip, Babel Fish renders this as "Call Me Ishmael. A few years ago - never the spirit how long with precision not having little or not money in my purse, and nothing particular to interest me on the shore, I thought myself would not sail about and would see the aqueous part of the world." In addition to being almost unreadable, a gratuitous "not" has been thrown in, making it seem as though the narrator had decided not to sail the seas! Google's rendering isn't perfect, but is a lot more palatable: "Call me Ishmael. A few years ago, never mind how long precisely, having little or no money in my purse, and nothing particular interest to me on the shore, I thought I would sail a little and see liquid part of the world." I think it is quite possible to figure out what this says. It's, in fact, getting on towards not being unpleasant to read. Not completely there, yet, but a great leap forward from Systran's technology.
Of course, where Google wants this to be useful is in search. And, already, you can see the results. If you search for something that primarily returns results in your non-native language, Google will helpfully, right now, include a "Translate this page" link next to each result. Unfortunately, this is a bit of a crap shoot, as the summaries are still in the native language (e.g., French), which you can't read to see if the page linked is interesting. However, they have a fascinating new feature, "Translated Search. This lets you search for, say, "Cancun restaurants" in Spanish, and see the results, translated, in English. To be more explicit, it translates "Cancun Restaurants" into "Restaurantes Cancún", performs a Google search on that, and then translates the individual results.
I read someone from Google a while back opining that they were hopeful that automated translation could make the web truly language-neutral. I scoffed. Il semble soudain beaucoup plus proche.