File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/97/j97-3001_abstr.xml

Size: 5,161 bytes

Last Modified: 2025-10-06 13:48:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="J97-3001">
  <Title>A Rule-based Hyphenator for Modern Greek</Title>
  <Section position="2" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Hyphenator programs in modern typesetting systems are necessary to eliminate excess space between adjacent words in texts. Word hyphenation could be bypassed by stretching out this space, but this would effect the appearance of the document. A hyphenator program takes as input a word and returns the set of points within the word where hyphens are permissible. Word hyphenation depends strictly on the target natural language, and many of the problems encountered are language specific.</Paragraph>
    <Paragraph position="1"> In general, machine hyphenation can be achieved either by consulting lists of hyphenated words or by developing pattern-based hyphenation programs (Liang 1983; Knuth 1986). The first approach ensures complete and correct hyphenation, but it has the disadvantage of being incapable of hyphenating words not on the list. In particular, for highly inflectional languages, such as Greek, these word lists would have to be extremely extensive in order to include all possible inflectional and derivational word forms. Even if such lists could be generated, it would be impossible to include words such as compounds, which can be readily created, or all proper names. In addition, the initial step toward the development of lists of hyphenated words is commonly rule-based hyphenation. On the other hand, although the second approach does not raise such problems, it has the disadvantage of being unable to guarantee complete and accurate hyphenation.</Paragraph>
    <Paragraph position="2"> The aim of the present study has been to analytically examine Modern Greek hyphenation in order to develop a pattern-based hyphenator. The requirement specifications are defined as follows: (i) to strictly prohibit impermissible hyphen generation; (ii) to generate a hyphen list that is as complete as possible.</Paragraph>
    <Paragraph position="3"> Existing hyphenator programs meet the first requirement either by decreasing the number of proposed hyphens or by establishing stop lists containing the appropriately hyphenated exceptional words. Commonly, fulfilling the second requirement depends on the development of extensive subword patterns associated with hyphenation rules, as in Liang 1983, for example. Establishing lists of exceptions has the same disadvantages as the approach to hyphenating through consulting lists of hyphenated words, * Computer Technology Institute, 3, Kolokotroni str., 26 223 Patras, Greece. E-mail: noussia@cti.gr Q 1997 Association for Computational Linguistics Computational Linguistics Volume 23, Number 3 and thus hyphenator dependencies on lists of exceptions must be restricted as much as possible.</Paragraph>
    <Paragraph position="4"> Native Greek speakers are able to hyphenate most Greek words fully and unambiguously. In extreme cases, they will propose two hyphen sets for the same word, one being a proper subset of the other, but both being acceptable. However, complete automatic hyphenation is a rather complex task. Although consonant splitting is clearly determined by the grammar rules of Modern Greek and is thus easily expressed in terms of non-exceptional formal patterns associated with specific hyphenation rules, vowel splitting is not. The main problem of vowel splitting is that the grammar indicates the cases where splitting is not allowed, and the splitting of a large number of these cases is ambiguous. In addition, Greek vowels are sometimes accented, so ambiguity resolution concerns thousands of vowel sequences.</Paragraph>
    <Paragraph position="5"> Existing hyphenator programs for Modern Greek are available as either commercial or research-based products and usually work on a minimal basis, i.e., finding only hyphenation points of consonant sequences. Some research-based versions, including one application of the TEX (Knuth 1986) hyphenator for Greek, achieve improved hyphenation but cover a minimal subset of the vowel sequences.</Paragraph>
    <Paragraph position="6"> The present paper will encapsulate the standard grammar hyphenation rules and the general principles used in this study. The paper expresses these rules (which focus mainly on consonant sequences) formally and points out their limitations in terms of formal word expressions that can be completely and correctly hyphenated.</Paragraph>
    <Paragraph position="7"> The paper then turns to the problem of vowel splitting, and, by formally examining prohibitive grammar rules, deduces general hyphenation rules. It presents additional heuristic rules discovered during an exhaustive search of ambiguous vowel patterns, and demonstrates the degree of the resolved ambiguity in terms of the number of vowel sequences that have been disambiguated. Implementation issues are discussed, as well as the problem of words written in uppercase letters. Finally, the paper outlines the potential for generalization to other languages.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML