File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0106_intro.xml

Size: 2,127 bytes

Last Modified: 2025-10-06 14:02:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0106">
  <Title>Induction of a Simple Morphology for Highly-Inflecting Languages</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"> We have produced gold standard segmentations with marked morpheme boundaries for 1.4 million Finnish and 36 000 English word forms. We evaluate the segmentations produced by our splitting algorithm against the gold standard, and compute precision and recall on discovered morpheme boundaries. Precision is the proportion of correct boundaries among all morph boundaries suggested by the algorithm. Recall is the proportion of correct boundaries discovered by the algorithm in relation to all morpheme boundaries in the gold standard.</Paragraph>
    <Paragraph position="1"> The gold standard was created semiautomatically, by first running all words through a morphological analyzer based on the two-level morphology of Koskenniemi (1983).1 For each  word form, the analyzer outputs the base form of the word together with grammatical tags indicating, e.g., the part-of-speech, case, or derivational type of the word form. In addition, the boundaries between the constituents of compound words are often marked. We thoroughly investigated the correspondence between the grammatical tags and the corresponding morphemes and created a rule-set for segmenting the original word forms with the help of the output of the analyzer.</Paragraph>
    <Paragraph position="2"> As there can sometimes be many plausibly correct segmentation of a word we supplied several alternatives when needed, e.g., English 'evening' (time of day) vs. 'even+ing' (verb). We also introduced so called &amp;quot;fuzzy&amp;quot; boundaries between stems and endings, allowing some letter to belong to either the stem or ending, when both alternatives are reasonable, e.g., English 'invite+s' vs. 'invit+es' (cf. 'invit+ing'), or Finnish 't&amp;quot;ahde+n' vs. 't&amp;quot;ahd+en' (&amp;quot;of the star&amp;quot;; the base form is 't&amp;quot;ahti').2</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML