Projects

L3

Within the current explosion in the quantity of information and in the means to access it, much of the world has been left behind because the information is not in a language that they understand. The L3 project ("Learning Lots of Languages") has the long-term goal of developing a system to translate to and from many under-represented languages of the Global South and (less ambitiously) of creating tools to be used in information retrieval and computer-assisted language learning with these languages.

The nature of this translation task, in particular the dearth of data for training the system, makes this a special case of machine translation. Within machine translation, as within the larger field of human language technology, there has been a recent move toward statistical methods by which a system is trained to perform tasks on the basis of patterns of occurrence that it gleans from massive amounts of data. For machine translation, the relevant data are bilingual texts, as well as texts within one or the other of the two languages. The problem for L3 is that these data are largely unavailable for the languages of interest; in fact, there may be few monolingual data for these languages. This means that we must rely initially on knowledge-based methods, that is, methods based on explicit grammars of the languages and, where available, bilingual dictionaries. While these methods will not be adequate in the long run, they should serve to get the translation system off the ground for these languages.

To go beyond the rudimentary translations possible with such a system, two sorts of problems need to be solved. First, we need a way to gather more data to be used in training the system using statistical methods. This is one purpose of the Guampa project, described below. Second, we need to integrate the two kinds of knowledge, the initial symbolic knowledge embodied in the grammar and the dictionaries, and the subsequent statistical knowledge gained from training on new data. The challenge of integrating these two sorts of knowledge into a single system is one of the fundamental problems in human language technology today.

Guampa

Traditionally, the lack of documents in a particular language has been addressed to a large extent by translators. Translation permits the use of relatively small languages in government, business, and education because documents written orginally in languages such as English, Spanish, and Chinese have been translated into these languages. This is an expensive and time-consuming process, however, and is best exemplified by national or regional languages spoken in relatively rich European countries, for example, Icelandic, Estonian, Slovenian, Catalan, and Basque.

Machine translation can help with this process, but since the goal is documents of publication quality, human translators will always need to be in the loop. Because the community of trained translators is relatively small for many languages with limited resources, collaboration is another way to speed up the production of documents. Computer-assisted translation (CAT) systems, both proprietary and free and open-source, are widely available and used by translators all over the world. However, none of these systems is optimized for languages with limited resources, and none goes beyond the most rudimentary help for collaborative translation.

The goal of the Guampa project is the development of a set of online tools for collaborative translation. We are making use of Wiki software that presents translators with texts to be translated from one language to the other, maintains a history of different versions of the translated texts, and permits users to comment on and discuss their alternate translations. We are focusing on the language pair Spanish-Guarani because of the committed group of language rights activists in Paraguay, where Guarani is spoken (in fact, the project got its name from the word used in Paraguay for the container that yerba mate is drunk from). However, we intend Guampa to be a general system for communities interested in translating between language pairs, especially when one and/or the other has few resources.

Chipa

A key problem in translation is lexical choice. A single word or phrase in the source language may have multiple possible translations in the target language. Producing a translation containing a wrong word may severely damage the fidelity or fluency of the translation. Cross-lingual disambiguation is the process of solving or attempting to solve this problem in a machine translation system. When translating into an under-represented language the problem is even more serious since the models typically used in machine translation will not be as effective due to lack of the data used to train the translation system. The goal of the Chipa project is to address cross-lingual disambiguation for machine translation into a language with few resources. One innovation is to make use of the resources available for the source language — including corpora relating the source language to other languages — to get a richer representation of the meaning of the input sentence so that we can make more accurate word choices.