Machine Translation

By

Willy Chaplin

June 1, 2007

Introduction

The motivation of this essay is that it is the author’s firm conviction that machine translation…applied to Web pages, email and chat…is the Next Big Thing for the World Wide Web. Imagine, if you will, what such technology could do for group computing sites like MySpace, FaceBook and YouTube. These sites are heavily engaged in bringing diverse groups of people together on the Web for various…user initiated…reasons. Unfortunately, almost all such sites depend heavily on the English language or, alternatively, are dedicated to some other specific language. This tends to “ghetto-ize” the Web, forcing the vast majority of users…who do not usually speak or read English…into dealing only with those sites that favor their native tongues. Now suppose that email, chat rooms and the Web Itself all have real time machine translation facilities, so the both reading and writing are done by the sender in his or her native language, while the recipient receives the communication in his or her language. Imagine how far this could take the true internationalization of the Net.

Slowing the inevitable use of machine translation is the rather poor quality of most such translations. At present, the sites presenting such technology are using it as a shill for their…pricey…human translations and offline translation software sales, so there is little incentive for developers to dedicate the time and labor…not to mention money…necessary to improve a technology that they are literally giving away free. Besides the often laughable results of such translations, the software is very insensitive to the way people actually communicate. That is, real humans providing the source text are prone to misspelling and grammatical errors, not to mention the acronymic shorthand that is developing around chat and text messaging communication. The author knows of no translation sites that even attempt to deal with this recent, but almost certain to be permanent, development.

A second problem arises around the failure of most development languages to deal robustly with unicode. An almost universal standard has arisen around UTF-8…a method of encoding and decoding unicode strings into byte strings, but the software has lagged behind. Using screen scraping…that is, software enabled Web page retrieval so that the pages can be mined for content programmatically…the author was able to use the free machine translation facilities of two of the better sites… BabelFish and InterTran…to enable real time translation of his Web sites. You can see the results at the master home page of The Dream Machine’s dreamagic.com site. Clicking the large red button on that page leads to a page allowing Web pages to be translated from English to any one of 27 other languages. Furthermore, once a page has been translated, most links to that page are also translated to the same language. The word “most” is used…rather than “all”…since the software has some difficulty with generated pages. Nevertheless, the fact that this works at all is a testimony to the skill of those corporation’s programmers, if not to the collaborating translators.

On the other hand, attempting to extend this technology to email stalled on various unicode problems. Languages like Russian, Korean or Chinese require more that one byte to represent each character. Developer software, e.g. PERL, a common way to generate Web pages programmatically, is very clumsy dealing with this fact. While PERL developers have developed a large number of kludges to deal with unicode, most PERL application programmers, including the author, have a devil of a time using them.

So, what to do about all this? The author will lay out a strategy involving three major goals based on his extensive experience both as a program developer and as an amateur linguist. They are:

This author will deal only with the latter two of these issues, since he has had little success getting anyone to even listen to his proposals, much less pony up the funds to get them going. Companies like MySpace, YouTube, FaceBook, Microsoft, Yahoo!, etc. have the wherewithal to carry out such a project, but apparently not the will. Google may have both. The existing, much smaller actual machine translation corporate entities seem to lack both the funding and the will.

Translator/Programmer Interaction

As mentioned in the introduction, translators and programmers have skills that don’t overlap very much. The author’s experience with both groups suggests that a way must be found to simplify the interaction between them. Since members of these two groups seldom share both skills, those that do…like the author…MUST take the lead in the development. They are the natural leaders who must emerge if machine translation is ever to really get off the ground.

In the next…and last section…the author will lay out a strategy that will attempt to bridge this gulf. At his age (72), he lacks the stamina…and probably the time…to really attack the problem single handedly. He has attempted to at least provide a few prototypical pieces of application software to demonstrate the potential of the technology, but no single individual can hope to deal with all the involved issues in an advanced manner.

In the early nineties, after seeing the movie 2001 for about the tenth time, the author was asked by a friend if he could develop a computer program like HAL, both the hero and villain of this classic movie. The only honest answer was “Probably not.” But it did pose an interesting challenge. So the author embarked on some experiments to see how he could manage the problem. Wise enough to start small, he decided to concentrate on “understanding” the various complications of expressing dates and times. For example, how could one deal with references to the same day expressed in all the ways real people talk about them. Suppose the current date is September 26, 2008 and one wishes to refer to October 2, 2008. Of course, the obvious and exact way to do so is to simply refer to “October 2, 2008.” This is exactly what most computer programs do. In ordinary conversation, that would be both awkward and a bit pedantic. So how about, “a week from today”? Or, “next Thursday”? Or, “on my birthday”…which happens to fall on that day? Or “the second of October” or “the first Thursday in October” or even, “exactly a month and two days before the election”. Each of these requires extensive knowledge about when certain things occur. That’s the easy part. Just a database problem.

The harder tasks are parsing the reference and deciding to what it actually refers. To make a long story short, the author, in the next decade, did develop a way of dealing with this problem that involved tokenization of the words…representing them by multi-byte numbers…along with a collection of a lot of data about the utterances including synonyms, antonyms and a new notion of a “trigger”…a word or phrase that represent a single concept with a “hard” meaning. This included not only nouns, but also phrases requiring a response by the computer, like “Tell me about…” or “Please explain…”. The reader will note that, in some sense, these two utterances “mean” the same thing, so they can be represented by the same token in the database. Indeed, tokenization had to apply to both single words and whole phrases for the parsing to work well.

Finally, when machine translation caught his interest, he noticed that many of the ideas he had developed playing with trying to build HAL seemed to be applicable. This will be discussed in the next section.

The Strategy

Suppose one had collected, for each major language in the world, a database of tokens representing words or phrases that have a particular meaning. Obviously, this would require a much more extensive collection of linguistic data than, say, an ordinary dictionary or thesaurus. Also, unlike those documents, which are organized around the appearance of individual words, this database would have to be organized around the meaning of the words or phrases. This would require some objective notion of the meaning of meaning…a classic problem in semantics…which will be dealt with later.

The idea would be that this database would be…not counting the actual physical representations… identical for all languages, so that, to translate from one to the other, would require only a few steps.

This process resembles the original machine translation programmers’ fondest dream, reducing the translation of one language to anther as a simple substitution problem, represented in the real world by the literal word-for-word translations that produce such terrible…but often quite amusing…results.

Furthermore, it is obvious that quality of the success of such an approach is completely depended upon the quality of the token database. The leads to the hard question, “What is the meaning of meaning?” This question has been tackled by philosophers of language from time immemorial and many proposals have been offered. We, however, need an objective approach to meaning, something we can put into a database. In addition, we need to collect an enormous amount of data to build the database for each language. Finally, to work in practice, this approach has to be extensible, modifiable and robust. If we can accomplish that, then however poorly the approach works at the outset, it can only get better as time goes on and data is added…and refined…in the database.

It is this author’s belief that the suggested approach will satisfy all these requirements. We already have…in machine-readable, and thus machine-collectible form…much of the data we would need to establish a prototypical token database. It’s called the World Wide Web. Each separate definition of a word would represent…at a basic level…a different meaning for that word and these can be rapidly collected and tokenized. But that’s only a start. Next, we would have to tag the data, to extend and refine the meaning of each representation. These tags would include, at least at the outset, a notion of “context.” This too, can be readily collected from the Web although it would also require review by humans. However, these people need not be translators nor programmers, merely familiar with their own native language, e.g. for example a Spanish language graduate student.

Also, some notion of subtypes would be necessary. That is, a noun represents a “person, place or thing”. Depending upon the context, this tag could be added to the already available categorization of “noun.” Other tags might include notions like “slang” or “acronym” or “surname”. Many of these can and should be derived by machine searches of the Web. The more that can be found that way, the easier will be the eventual task of tokenizing the source text.

Once a token database has been created for each language, the translators…and programmers…need to insure that each token refers to the same meaning in each language. Remember, the tokens represent whole phrases as well as single words. So the translators need only ensure that a translation…in either direction…results in the same understanding by users. Tagging this database would be a much more straightforward task than producing a translation of an entire document and could thus be done on a mass production scale by relatively unskilled translators. Borrowing an idea from Wikipedia, it could also be done almost completely online, with translators checking up on one another.

The most difficult tasks will be the parsing of the input and the production of grammatically correct and understandable output. This can probably be done to a large part with plug-in-the-blanks templates. But, it will require an extensive knowledge of syntax and grammar on the part of the developers of those templates.

Summary

The suggested strategy, although quite as massive as one would expect of such a enormous undertaking, can be accomplished by relatively disconnected and low level personnel, not to mention by computer searches. Only the AI development stages require extensive knowledge of both language and programming. Finally, and most importantly, once established, the technology can only get better. The content of the Web…in all languages…will continue to grow exponentially for some time. Each addition of new words, phrases, meanings to the token database will make the whole process marginally better. When the tokenizers of one language have to introduce a new meaning tag, then every other language database can add this tag as a new database tag, probably doing most of the work programmatically to update the database.

In short, the process needs and uses the Web, the Web needs and uses the process. What better positive gain feedback loop could there be?


Willy Chaplin is an AI expert with almost a half century of experience in the field. He was on the Internet in December 1969, one month after it started up as ARPANET, has designed and written almost 2,000,00 lines of computer code and has post-graduate level training in Mathematics, Computer Science, Psychology and Linguistics.