Slowing the inevitable use of machine translation is the rather poor quality of most such translations. At present, the sites presenting such technology are using it as a shill for their…pricey…human translations and offline translation software sales, so there is little incentive for developers to dedicate the time and labor…not to mention money…necessary to improve a technology that they are literally giving away free. Besides the often laughable results of such translations, the software is very insensitive to the way people actually communicate. That is, real humans providing the source text are prone to misspelling and grammatical errors, not to mention the acronymic shorthand that is developing around chat and text messaging communication. The author knows of no translation sites that even attempt to deal with this recent, but almost certain to be permanent, development.
A second problem arises around the failure of most development languages to deal robustly with unicode. An almost universal standard has arisen around UTF-8…a method of encoding and decoding unicode strings into byte strings, but the software has lagged behind. Using screen scraping…that is, software enabled Web page retrieval so that the pages can be mined for content programmatically…the author was able to use the free machine translation facilities of two of the better sites… BabelFish and InterTran…to enable real time translation of his Web sites. You can see the results at the master home page of The Dream Machine’s dreamagic.com site. Clicking the large red button on that page leads to a page allowing Web pages to be translated from English to any one of 27 other languages. Furthermore, once a page has been translated, most links to that page are also translated to the same language. The word “most” is used…rather than “all”…since the software has some difficulty with generated pages. Nevertheless, the fact that this works at all is a testimony to the skill of those corporation’s programmers, if not to the collaborating translators.
On the other hand, attempting to extend this technology to email stalled on various unicode problems. Languages like Russian, Korean or Chinese require more that one byte to represent each character. Developer software, e.g. PERL, a common way to generate Web pages programmatically, is very clumsy dealing with this fact. While PERL developers have developed a large number of kludges to deal with unicode, most PERL application programmers, including the author, have a devil of a time using them.
So, what to do about all this? The author will lay out a strategy involving three major goals based on his extensive experience both as a program developer and as an amateur linguist. They are:
This author will deal only with the latter two of these issues, since he has had little success getting anyone to even listen to his proposals, much less pony up the funds to get them going. Companies like MySpace, YouTube, FaceBook, Microsoft, Yahoo!, etc. have the wherewithal to carry out such a project, but apparently not the will. Google may have both. The existing, much smaller actual machine translation corporate entities seem to lack both the funding and the will.
In the next…and last section…the author will lay out a strategy that will attempt to bridge this gulf. At his age (72), he lacks the stamina…and probably the time…to really attack the problem single handedly. He has attempted to at least provide a few prototypical pieces of application software to demonstrate the potential of the technology, but no single individual can hope to deal with all the involved issues in an advanced manner.
In the early nineties, after seeing the movie 2001 for about the tenth time, the author was asked by a friend if he could develop a computer program like HAL, both the hero and villain of this classic movie. The only honest answer was “Probably not.” But it did pose an interesting challenge. So the author embarked on some experiments to see how he could manage the problem. Wise enough to start small, he decided to concentrate on “understanding” the various complications of expressing dates and times. For example, how could one deal with references to the same day expressed in all the ways real people talk about them. Suppose the current date is September 26, 2008 and one wishes to refer to October 2, 2008. Of course, the obvious and exact way to do so is to simply refer to “October 2, 2008.” This is exactly what most computer programs do. In ordinary conversation, that would be both awkward and a bit pedantic. So how about, “a week from today”? Or, “next Thursday”? Or, “on my birthday”…which happens to fall on that day? Or “the second of October” or “the first Thursday in October” or even, “exactly a month and two days before the election”. Each of these requires extensive knowledge about when certain things occur. That’s the easy part. Just a database problem.
The harder tasks are parsing the reference and deciding to what it actually refers. To make a long story short, the author, in the next decade, did develop a way of dealing with this problem that involved tokenization of the words…representing them by multi-byte numbers…along with a collection of a lot of data about the utterances including synonyms, antonyms and a new notion of a “trigger”…a word or phrase that represent a single concept with a “hard” meaning. This included not only nouns, but also phrases requiring a response by the computer, like “Tell me about…” or “Please explain…”. The reader will note that, in some sense, these two utterances “mean” the same thing, so they can be represented by the same token in the database. Indeed, tokenization had to apply to both single words and whole phrases for the parsing to work well.
Finally, when machine translation caught his interest, he noticed that many of the ideas he had developed playing with trying to build HAL seemed to be applicable. This will be discussed in the next section.
The idea would be that this database would be…not counting the actual physical representations… identical for all languages, so that, to translate from one to the other, would require only a few steps.
This process resembles the original machine translation programmers’ fondest dream, reducing the translation of one language to anther as a simple substitution problem, represented in the real world by the literal word-for-word translations that produce such terrible…but often quite amusing…results.
Furthermore, it is obvious that quality of the success of such an approach is completely depended upon the quality of the token database. The leads to the hard question, “What is the meaning of meaning?” This question has been tackled by philosophers of language from time immemorial and many proposals have been offered. We, however, need an objective approach to meaning, something we can put into a database. In addition, we need to collect an enormous amount of data to build the database for each language. Finally, to work in practice, this approach has to be extensible, modifiable and robust. If we can accomplish that, then however poorly the approach works at the outset, it can only get better as time goes on and data is added…and refined…in the database.
It is this author’s belief that the suggested approach will satisfy all these requirements. We already have…in machine-readable, and thus machine-collectible form…much of the data we would need to establish a prototypical token database. It’s called the World Wide Web. Each separate definition of a word would represent…at a basic level…a different meaning for that word and these can be rapidly collected and tokenized. But that’s only a start. Next, we would have to tag the data, to extend and refine the meaning of each representation. These tags would include, at least at the outset, a notion of “context.” This too, can be readily collected from the Web although it would also require review by humans. However, these people need not be translators nor programmers, merely familiar with their own native language, e.g. for example a Spanish language graduate student.
Also, some notion of subtypes would be necessary. That is, a noun represents a “person, place or thing”. Depending upon the context, this tag could be added to the already available categorization of “noun.” Other tags might include notions like “slang” or “acronym” or “surname”. Many of these can and should be derived by machine searches of the Web. The more that can be found that way, the easier will be the eventual task of tokenizing the source text.
Once a token database has been created for each language, the translators…and programmers…need to insure that each token refers to the same meaning in each language. Remember, the tokens represent whole phrases as well as single words. So the translators need only ensure that a translation…in either direction…results in the same understanding by users. Tagging this database would be a much more straightforward task than producing a translation of an entire document and could thus be done on a mass production scale by relatively unskilled translators. Borrowing an idea from Wikipedia, it could also be done almost completely online, with translators checking up on one another.
The most difficult tasks will be the parsing of the input and the production of grammatically correct and understandable output. This can probably be done to a large part with plug-in-the-blanks templates. But, it will require an extensive knowledge of syntax and grammar on the part of the developers of those templates.
In short, the process needs and uses the Web, the Web needs and uses the process. What better positive gain feedback loop could there be?