processing languages
of the global south

Minimal Dependency Translation

Introduction

For the great majority of the world's languages we lack adequate resources to make use of the machine learning techniques that have become the standard for modern computational linguists. Languages with inadequate resources include not only those with few speakers, many of them endangered, but also a number of Asian and African languages with tens of millions of speakers, such as Telugu, Burmese, Oromo, and Hausa. For machine translation (MT) and computer-assisted translation (CAT), the lack is even more serious because what is required is bitext, sentence-aligned translations between the language in question and another language.

For these reasons, work on many such languages will continue to consist in large part in the writing of computational grammars and lexica by people. Because such work normally requires significant training and is notoriously time-consuming, there is a need for tools to permit researchers and language technology users to "get off the ground" with these languages, that is, to create rudimentary grammars and lexica that will permit some basic applications, and, in the case of endangered languages, will facilitate the documentation process.

We are particularly interested in MT and CAT and the grammars and lexica that they require. Our long-term goal is a system that allows naive users to write bilingual lexicon-grammars for low-resource languages that can also be updated on the basis of monolingual and bilingual corpora, to the extent these are available. Our short-term goal is the development of a running CAT system for two language pairs using a simple grammar and lexicon of the type we are developing.

In this document we describe the initial steps in developing Minimal Dependency Translation, a lexical-grammatical framework for MT and CAT. Although our focus is on the language pairs Spanish-Guarani, Amharic-English, and Amharic-Oromo, we illustrate MDT with examples from English-Spanish.

Lexica and grammars

Phrasal lexica

The idea of treating phrases rather than individual words as the basic units of a language goes back at least to the proposal of a Phrasal Lexicon by Becker1. In recent years, the idea has gained currency within the related frameworks of Construction Grammar2 and Frame Semantics3, as well as in phrase-based statistical machine translation, which in one form or another now dominates the MT field. Arguments in favor of phrasal units are often framed in terms of the ubiquity of idiomaticity, that is, departure to one degree or another from strict compositionality.

Seen another way, phrasal units address the ubiquity of lexical ambiguity. If a verb's interpretation depends on its object or subject, then it may make more sense to the combination of the verb and particular objects or subjects as units in their own right.

These related arguments based on idiomaticity and ambiguity are semantic; they point out the advantages of phrasal units in systems that are concerned with meaning. But the arguments extend naturally to translation. If the meaning of a phrase in the source language fails to be the strict combination of the meanings of the individual words in the phrase, then it is unlikely that the translation of the phrase will be the combination of the translations of the words in the phrase. If a noun or verb has multiple translations, then adding lexical context to the noun or verb could permit an MT system to select the appropriate translation.

A simple phrasal lexicon

The basic lexical entries of MDT are multi-word units called groups. Each group represents a catena4. Catenae go beyond constituents (phrases), including all combinations of elements that are continuous in the vertical dimension within a dependency tree. For example, in the sentence I gave her a piece of my mind, I, gave and gave, her, piece are among the catenae but not the constituents of the sentence.

A catena has a head, and each MDT group must also have a head, the main function of which is to index the group within the lexicon. A group's entry may also specify translations to groups in one or more other languages, and as in the phrase tables of phrase-based statistical machine translation (PBSMT) systems. For each translation, the group's entry gives an alignment between the groups, representing correspondences between group elements, again as in PBSMT. Figure 1 shows a simple group entry of this sort. The English group the end of the world with head end has as its Spanish translation the group el fin del mundo (which must have an entry in the Spanish lexicon). In the alignment, all but the fourth word (the) in the English group is associated with a word in the Spanish group.
end:
  - the end of the world
    ->spa
      - [el_fin_del_mundo,
         {align: [1,2,3,0,4]}]

Figure 1. Entry for the end of the world and its Spanish translation.

The lexicon-grammar tradeoff

A rudimentary lexicon with entries like the one in Figure 1 is simple in two senses: with the help of a simple interface, a user with no knowledge of linguistics or the grammar of either language can add entries in a straightforward manner, and the resulting entries are easily understood. Such a lexicon permits the translation of sentences consisting of verbatim combinations of the word forms in the group entries, as long as group order is preserved across the languages and there are no constraints between groups that would affect the form of the target-language words. However, since it contains no grammatical information, such a lexicon permits no generalization to combinations of wordforms that are not explicit in group lexical entries. We are left with the need to include group entries for every reasonably possible combination of wordforms. Even when enormous bitext corpora are available, as for language pairs like English-Spanish, SMT researchers have discovered the need to incorporate some syntax in their systems. At the other other extreme from this simple lexicon is a full-blown grammar that is driven by the traditional linguistic concern — one might even say obsession — with maximum parsimony: every possible generalization must be "captured". Although it has the advantage of compactness and of possibly reflecting general principles of linguistic structure, such a grammar is difficult to write, to debug, and to understand, requiring significant knowledge of linguistics. As we have seen, abstract word-based grammars also miss the information that is inherent in words in context. In the MDT project, the goal is to permit a range of possibilities along the continuum from purely lexical (and phrasal) to syntactic/grammatical, with the emphasis on ease of entry creation and interpretation.

Lexemes

We can achieve significant generalization over simple groups consisting of wordforms only by permitting lexemes in groups. As an example, consider the English group passV the buck, where passV is the verb lexeme pass. In order to make such a group usable, the lexicon also requires knowledge of morphology. This could take the form of a dedicated morphological analyzer or, for simple languages like English, a dictionary of word forms and their roots and grammatical features. For the purposes of this paper and English-Spanish translation, we assume the latter. Figure 2 shows several of these word form entries for English, along with the group entry.
groups:
  passV:
  - passV the buck
forms:
  pass:
  - root: passV, features: {prs: 1, tns: prs}
  - root: passV, features: {prs: 2, tns: prs}
  - root: passN, features: {num: sng}
  - root: passV, features: {prs: 3, num: plr, tns: prs}
  passes:
    root: passV, features: {prs: 3, num: sng, tns: prs}
  passed:
    root: passV, features: {tns: pst}

Figure 2. Group entry for pass the buck and three word form entries

Because the entry for pass the buck accommodates multiple sequences of English word forms, there needs to be a way to map these onto the appropriate sequences in the target language. In MDT, the simplest way to accomplish this is a set of pairs of agreement features for the lexeme that constrain the corresponding target language form to agree with the source form on those features. In Figure 3, we see agr attributes for the translation, none for the words the and buck, and for the head passV, agreement between the tense and tiempo, person and persona, and number and número features. For example, if this group is activated in the translation of the sentence Carl passes the buck, the head of the Spanish translation of the group will be constrained to be third person singular present tense (tiempo): Carl escurre el bulto.
passV:
- passV the buck
  ->spa
    - escurrir_el_bulto,
       {align: [1,2,3], agr: [{tns: tmp, prs: prs, num: num}, 0, 0]}]

Figure 3. Group entry for pass the buck with Spanish translation

Lexical/grammatical categories

Another simple way to generalize across groups is to introduce syntactic or semantic categories. Consider again the English expression give somebody a piece of one's mind. We can generalize across specific word sequences such as gave me a piece of his mind and gave them a piece of my mind by replacing the specific word forms in positions 2 and 6 in the group with categories that include the wordforms that can fill those positions. This requires the form dictionary to record the categories that wordforms belong to. Figure 4 shows how the entry for give somebody a piece of one's mind would record this information. Category names are preceded by $.

groups:
  giveV:
  - giveV $sbd a piece of $sbds mind
    ->spa
      - [cantar_$algn_las_cuarenta,
         {align: [1,2,3,4,0,0,0],
          agr: [{tns: tmp, prs: prs, num: num}, 0,0,0,0,0,0]}]
  my:
  - my
    ->spa
      - [mi]
      - [mis]
  mayor:
  - the mayor
forms:
  my:
  - cats: [$sbds]
  mayor:
  - cats: [$sbd]

Figure 4. Group entry for give somebody a piece of one's mind and a few associated form entries.

Because group positions that are filled by categories do not specify a surface form, for parsing and generation of sentences, they must be merged with other groups that match the category and do specify a form. For example, to parse the sentence I gave the mayor a piece of my mind requires that positions 2 and 6 in the group giveV_$sbd_a_piece_of_$sbds_mind be filled by the heads of the groups the_mayor and my. This merging process is illustrated in Figure 5.

Figure 5. Merging of three groups in gave the mayor a piece of my mind.

This group illustrates another requirement of some groups containing categories. In give somebody a piece of one's mind, the possessive adjective in the place of one's must agree with the subject of the sentence. Since the group contains no subject, we constrain it to agree with the person and number of the verb. Thus the entry for this group also contains the agreement attribute: agr: [[1, 6, [prs, prs], [num, num]]]. This states that the sixth group element agrees with the first on person and number features.

Constraint satisfaction and translation

Translation in MDT takes place in three phases: analysis, transfer, and realization. Analysis of the source-language sentence begins with morphological analysis, either through a dedicated morphological anlyzer for the source language or a lexical lookup of the wordforms in the source language forms dictionary. The words or lexemes resulting from this first pass are then used to look up candidate groups in the groups dictionary. Next the system assigns a set of groups to the input sentence. A successful group assignment satisfies several constraints:

  1. Each word in the input sentence is assigned to zero, one, or (in the case of node merging) two group elements.
  2. Each element in a selected group is assigned to one word in the sentence.
  3. For each selected group, within-group agreement restrictions are satisfied.
  4. For each category element in a selected group, it is merged with a non-category element in another selected group (see the two examples in Figure 4).

Analysis is a robust process; some words in the input sentence may not end up unassigned to any group.

Analysis is implemented in the form of constraint satisfaction, making use of insights from the Extensible Dependency Grammar framework (XDG)5. Although considerable source-sentence ambiguity is eliminated because groups incorporate context, ambiguity is still possible, particularly for figurative expressions that also have a literal interpretation. In this case, the constraint satisfaction process undertakes a search through the space of possible group assignments, creating an analysis for each successful assignment.

During the transfer phase, a source-language group assignment is converted to an assignment of target-language groups. In this process some target-language items are assigned grammatical features on the basis of agreement constraints. For example, in the translation of the English sentence the mayor passes the buck, the Spanish verb that is the head of the group escurrir_el_bulto would be assigned the tense (tiempo), person and number features {tmp=prs, prs=3, num=1}: escurre. A source-language group may have more than one translation. The transfer phase creates a separate target-language group assignment for each combination of translations of the source-language groups.

During the realization phase, for each target-language group assignment, surface forms are generated based on the lexemes and grammatical features that resulted from the transfer phase. This is accomplished either through a dedicated morphological generator for the target language or a dictionary that maps target-language lexemes and feature sets to surface forms. Finally, target-language words are sequenced in a way that satisfies word-order conditions in target-language groups. The sequencing process is implemented with constraint satisfaction.

Related work

Our goals are similar to those of the Apertium project.6 As with Apertium, we are developing open-source, rule-based systems for MT, and we work within the framework of relatively shallow, chunking grammars. We differ mainly in our willingness to sacrifice linguistic coverage to achieve our goals of flexibility, robustness, and transparency. We accommodate a range of lexical-grammatical possibilities, from the completely lexical on the one extreme to phrasal units consisting of a single lexeme and one or more syntactic/semantic categories on the other, and we are not so concerned that MDT grammars will accept many ungrammatical source-language sentences or even output ungrammatical (along with grammatical) translations.

In terms of long-term goals, MDT also resembles the Expedition project,7 which makes use of knowledge acquisition techniques and naive monolingual informants to develop MT systems that translate low-resource languages into English. Our project differs first, in assuming bilingual informants and second, in aiming to develop systems that are unrestricted with respect to target language. In fact we are more interested in MT systems with low-resource languages as target languages because of the lack of documents in such languages.

Although MDT is not intended as a linguistic theory, it is worth mentioning which theories it has the most in common with. Like Construction Grammar8 and Frame Semantics,9 it treats linguistic knowledge as essentially phrasal. Like synchronous context-free grammar (SCFG)10, it associates multi-word units in two languages, aligning the elements of the units and representing word order within each. MDT differs from SCFG in having nothing like rewrite rules or non-terminals. MDT belongs to the family of dependency grammar (DG) theories because the heads of its phrasal units are words or lexemes rather than non-terminals. It shares most with those computational DG theories that rely on constraint satisfaction.5,11,12,13 However, it remains an extremely primitive form of DG, permitting only flat structures with unlabeled arcs and no relations between groups other than through the merge operation described above. This means that complex grammatical phenomena such as long-distance dependencies and word-order variability can only be captured through specific groups.

Status of project, ongoing and future work

The code for MDT and a set of lexical-grammatical examples are available at https://github.com/hltdi/mainumby under the GPL license. To date, we have only tested the framework on a limited number of translations using various language pairs. In order to develop more complete lexicon-grammars for Amharic-Oromo, Amharic-English and Spanish-Guarani, we are working on methods for automatically extracting groups from dictionaries in various formats and from the limited bilingual data that are available. As a part of this work, it will be crucial to determine whether it is simpler to extract MDT groups from data than to extract grammars of other sorts, for example, SCFG. We are also implementing a GUI that will allow naive bilingual users to create MDT entries. Again we will want to evaluate the framework with respect to the simplicity of entry creation. For the longer term, our goal is tools for the intelligent elicitation of lexical entries; for example, when two entries resemble one another, users could be queried about the value of collapsing them into a more abstract entry.

As far as the grammatical framework is concerned, the lack of dependencies between the heads of groups leaves the system without the capacity to represent some agreement constraints, for example, agreement between a verb+object group and the verb's subject, or major constituent order differences between source and target language.\footnote{ The only way to implement such constraints in the current version of MDT is through groups that incorporate, for example, subjects in verb-headed groups, as in $sbd kickV $sth. To alleviate this problem, we will be implementing dependencies between group heads, much as in the "interchunk module" of Apertium.

Conclusions

Relatively sophisticated computational grammars, parsers, and/or generators exist for perhaps a dozen languages, and usable MT systems exist for at most dozens of pairs of languages. This leaves the great majority of languages and the communities who speak them relatively even more disadvantaged than they were before the digital revolution. What is called for are methods that can be quickly and easily deployed to begin to record the grammars and lexica of these languages and to use these tools for the benefit of the linguistic communities. The MDT project is designed with these needs in mind. Though far from achieving our ultimate goals, we have developed a simple, flexible, and robust framework for bilingual lexicon-grammars and MT/CAT that we hope will be a starting point for a large number of under-resourced languages.

References

1Becker, J. 1975. The phrasal lexicon. In R. Schank and B. Nash-Webber, editors, Theoretical Issues in Natural Language Processing, pp. 38-41. Association for Computational Linguistics.

2,8Steels, L. (Ed.). 2011. Design Patterns in Fluid Construction Grammar. John Benjamins, Amsterdam.

3,9Fillmore, C.J. and Baker, C.F. 2001. Frame semantics for text understanding. In Proceedings of WordNet and Other Lexical Resources Workshop, Pittsburgh, NAACL.

4Osborne, T., Putnam, M., and Gross, T. 2012. Catenae: Introducing a novel unit of syntactic analysis. Syntax, 15(4): 354-396.

5Debusmann, R. 2007. Extensible Dependency Grammar: a Modular Grammar Formalism Based on Multigraph Description. Ph.D. thesis, Universität des Saarlandes.

6Forcada, M.L., Ginestí-Rosell, M., Nordfalk, J., O'Regan, J., Ortiz-Rojas, S., Pérez-Ortiz, J.A., Sánchez-Martínez, F., Ramírez-Sánchez, G., and Tyers, F.M. 2011. Apertium: a free/open-source platform for rule-based machine translation. Machine Translation, 25(2): 127-144.

7McShane, M., Nirenburg, S., Cowie, J., and Zacharski, R. 2002. Embedding knowledge elicition and MT systems within a single architecture. Machine Translation, 17: 271-305.

10Chiang, C. 2007. Hierarchical phrase-based translation. Computational Linguistics, 33(2): 201-228.

11Bojar, O. 2005. Problems of inducing large coverage constraint-based dependency grammar. In H. Christiansen et al. (Eds.), Constraint Solving and Language Processing, First International Workshop (CSLP 2004), pp. 90-103. Berlin. Springer Verlag.

12Foth, K. and Menzel, W. 2006. Hybrid parsing: Using probabilistic models as predictors for a symbolic parser. In Proceedings of the Annual Conference of the Association for Computational Linguistics, Sydney, Australia.

13Wang, W. and Harper, M. 2004. A statistical constraint dependency grammar (CDG) parser. In Proceedings of ACL04 Incremental Parsing Workshop, Barcelona, Spain.