Google Could Own the Internet
An Immodest Proposal
By
Willy Chaplin
June 23, 2008
A New Vision
The Web has a dire need for a standard way of representing language. Fortunately, such a standard exists and Google's engineers are obviously aware of it…UTF-8. The implementation of the translation API is VERY good, for speed as well as accuracy. However, Google could go further, and by doing so could capture an unassailable pre-eminence - not in the Web as it is - but in the Web as it must be in the future.
Suppose that Google were to make a deal with the Mozilla Foundation to take over development of the FireFox browser. Keep it open source, since that is in line with using the Web users themselves to speed development…but perhaps apply a bit more discipline to its development. Google could make UTF-8 the standard encoding, allowing no other.
This would mean that the majority of existing Web pages...now written in single byte codes in Western European languages...would be displayed properly, but also force the coming explosive development in the multi-byte world to follow that standard. Google could write page-translation applications to convert odd character representations into UTF-8.
Next, each user of the new and better FireFox would place in his or her options, the preferred language to browse the Web. Thereafter, (almost) EVERY page on the Web would appear in the browser in the preferred tongue. Those that were not originally in that language would be real-time translated into it. The language pairs that Google already supports would probably make this work for 99% of the pages on the Web. The same effect would also take place on Google searches.
Similarly, Google could make an email application that does a similar thing. That is, when an email is sent, it would contain only the sender's language. When it arrived, it would also contain a translation of that text into the recipient's preferred language.
Finally, following the lead of sites like FaceBook and MySpace, Google could use the methodology above to create the next step in group-formation applications. Only Google's would have built in real-time translation for EVERYTHING. Chat rooms would give each participant the opportunity to see all chatter in his or her preferred language. Indeed, while FaceBook has improved on MySpace, both are foundering on the two fronts. The first is what might be called “unanticipated side effects” of their designs. The amount of downtime each experiences…not to mention severe slowdowns…from excess demand on their servers is getting worse all the time. Furthermore, the language barrier has not even been attempted to be breached on either of these sites. Yes, you can say things in Turkish or Arabic…but so what if nobody outside Turkey or the Arab countries can read it? Google has the wherewithal…and I think the will...to make all this happen…and in a very short time.
Some ups and downs from my own experiments:
- In line with my feelings about real time translation being the Next Big Thing on the Web, I spent the last six months attempting, by various approaches, to write a prototype email application to demonstrate this. This gave me a real lesson on the non-implementation of UNICODE support among the many languages used to construct Web pages. The bad news is that after trying numerous combinations of methods to accomplish the end of creating a real-time translating email application, I surrendered to the UNICODE gremlins and gave up. Here's a summary of my attempts:
- First, I attempted to write a PERL module that screen-scraped Yahoo! BabelFish's free translation site…powered by Systran…and then sent a two part email to someone. The first part of the email was the text in its original language. The second was a translation of the first in one of the supported language pairs. This worked fine for all Western European alphabets, but failed miserably on multi-byte UTF-8. The translations were sometimes excellent, sometimes terrible. Click HERE to see my best attempt at this.
- Next, I tried a similar approach using InterTran's free translation site. InterTran uses a variety of text encodings. This meant that the encoding had to be set separately for each language pair and sometimes they could not both live on the same page. This also worked fairly well for Western European languages, but the quality of the translations is generally VERY poor. The translations were so poor that I abandonned approach this and have no example to show.
- Finally, at the urging of one of your programmers, I tried using Google's translation API, a JAVASCRIPT method. This was by far the easiest to program…up to a point…and the quality of the translations was far better than the other two. The problem occurs when one tries to transfer the text to an email and send it. First I tried a simple “mailto.” This worked for Western European languages, but collapsed when multi-byte utf-8 characters were involved, the Outlook Express email application turning them all into question marks. Then I tried sending the complete text to a PERL application, but got the same results as I got with the screen-scraped text. Finally, I wrote a JAVA applet that sent email and…well, it ALMOST worked. Unfortunately, that approach…using the JAVAX sendMail package…did indeed pass along most of the UTF-8 multibyte characters, but for some unknown reason, garbled a lot of them as well. Click HERE to see my best attempt at this.
By the way; although I am familiar with several non-English languages, I checked the quality of the translations I could not read by feeding them back the other way to see if the result resembled the original in a meaningful way. So, what next? The good news…for me and for Google…is that this opens up a whole basket of wonderful opportunities for advancing the real-time translation idea into a whole new Internet era.