XML Viewer - a92-1047

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/a92-1047_metho.xml
Size: 8,037 bytes
Last Modified: 2025-10-06 14:12:56
<?xml version="1.0" standalone="yes"?>
<Paper uid="A92-1047">
  <Title>Lexical Processing in the CLARE System</Title>
  <Section position="3" start_page="0" end_page="259" type="metho">
    <SectionTitle>
3 CLARE's Processing Stages
</SectionTitle>
    <Paragraph position="0"> The CLARE system is intended to provide language processing capabilities (both analysis and generation) and some reasoning facilities for a range of possible applications. English sentences are mapped, via a number of stages, into logical representations of their literal meanings, from which reasoning can proceed. Stages are linked by well-defined representations. The key intermediate representation is that of quasi logical form (QLF), a version of first order logic augmented with constructs for phenomena such as anaphora and quantification that can only be resolved by reference to context. The unifica- null tion of declarative linguistic data is the basic processing operation.</Paragraph>
    <Paragraph position="1"> In the analysis direction, CLARE's front end processing stages are as follows. A sentence is divided into a sequence of clusters separated by white space. Each cluster is then divided into one or more tokens: words (possibly inflected), punctuation characters, and other items. Tokenization is nondeterministic, and so a lattice is used at this and subsequent stages. Next, each token is analysed as a sequence of one or more segments.</Paragraph>
    <Paragraph position="2"> For normal lexical items, these segments are morphemes.</Paragraph>
    <Paragraph position="3"> The lexicon proper is first accessed at this stage. Various strategies for error recovery (including but not limited to spelling/typing correction) are then attempted on tokens for which no segmentation could be found. After this, edges without segmentations are deleted; if no complete path remains, sentence processing is abandoned. Further edges, possibly spanning non-adjacent vertices, are added to the lattice by the phrasal equivalence mechanism mentioned above. Finally, morphological, syntactic and semantic stages apply to produce one or more quasi logical forms (QLFs). These are checked for adherence to sortal (selectional) restrictions, and, possibly with the help of user intervention, one is selected for further processing. null</Paragraph>
  </Section>
  <Section position="4" start_page="259" end_page="259" type="metho">
    <SectionTitle>
4 Segmentation and Spelling Correction
</SectionTitle>
    <Paragraph position="0"> English inflectional morphology is sufficiently simple to allow CLARE to use a fairly simple affix-stripping approach to token segmentation. One major advantage of this is that spelling correction can be interleaved directly with it. Root forms in the lexicon are represented in a discrimination net for efficient access. When the spelling corrector is called to suggest possible corrections for a word, the number of simple errors (of deletion, insertion, substitution and transposition) to assume is given.</Paragraph>
    <Paragraph position="1"> NormM segmentation is just the special case of this with the number of errors set to zero. The mechanism non-deterministically removes affixes from each end of the word, postulating errors if appropriate, and then looks up the resulting string in the discrimination net, again considering the possibility of error.</Paragraph>
    <Paragraph position="2"> Interleaving correction with segmentation promotes efficiency in the following way. As in most other correctors, only up to two simple errors are considered along a given search path. Therefore, either the affix-stripping phase or the lookup phase is fairly quick and produces a fairly small number of results, and so the two do not combine to slow processing down. Another beneficial consequence of the interleaving is that no special treatment is required for the otherwise awkward case where errors overlap morpheme boundaries; thus desigend is corrected to designed as easily as deisgned or designde are.</Paragraph>
    <Paragraph position="3"> If one or more possible corrections to a token are found, they are preserved as alternatives for disambiguation at the later syntactic or semantic stages. The lattice representation allows nmltiple-word corrections (involving both the insertion and the deletion of spaces) to be preserved along with single-word ones. The choice is only finally made when a sortally coherent QLF is selected.</Paragraph>
  </Section>
  <Section position="5" start_page="259" end_page="259" type="metho">
    <SectionTitle>
5 An Evaluation
</SectionTitle>
    <Paragraph position="0"> To assess the usefulness of syntactico-semantic constraints in CLARE's spelling correction, the following experiment was carried out. Five hundred sentences falling within CLARE's current lexical and grammatical coverage were taken at random from the LOB corpus. Although CLARE's core lexicon is fairly small (1600 root forms), it consists of the more frequent words in the language, which tend to be fairly short and therefore have many candidate corrections if misspelled. The sentences were passed, character by character, through a channel which transmitted a character without alteration with probability 0.99, and with probability 0.01 introduced one of the four kinds of simple error. This process produced a total of 102 sentences that differed from their originals. The average length was 6.46 words, and there were 123 corrupted tokens in all.</Paragraph>
    <Paragraph position="1"> The corrupted sentence set was then processed by CLARE with only the spelling correction recovery method in force and with no user intervention. Up to two simple errors were considered per token. No domain-specific or context-dependent knowledge was used.</Paragraph>
    <Paragraph position="2"> Of the 123 corrupted tokens, ten were corrupted into other known words, and so no correction was attempted.</Paragraph>
    <Paragraph position="3"> Parsing failed in nine of these cases; in the tenth, the corrupted word made as much sense as the original out of discourse context. In three further cases, the original token was not among the corrections suggested. The corrections for two other tokens were not used because a corruption into a known word elsewhere in the same sentence caused parsing to fail.</Paragraph>
    <Paragraph position="4"> Only one correction (the right one) was suggested for 59 of the remaining 108 tokens. Multiple-token correction, involving the manipulation of space characters, took place in 24 of these cases.</Paragraph>
    <Paragraph position="5"> This left 49 tokens for which more than one correction was suggested, requiring syntactic and semantic processing for further disambiguation. The average number of corrections suggested for these 49 was 4.57. However, only an average of 1.69 candidates (including, because of the way the corpus was selected, all the right ones) appeared in QLFs satisfying selectional restrictions; thus over 80% of the wrong candidates were rejected. Treating all candidates as equally likely in the absence of frequency information, syntactic and semantic processing reduced the average entropy from 1.92 to 0.54, removing 72% of the uncertainty. Comparisons of parsing times showed that a lattice could be parsed many times faster than separate alternative strings when the problem token is towards the end of the sentence and/or has several syntactically plausible candidate corrections.</Paragraph>
    <Paragraph position="6"> The corpus on which the experiment was carried out consisted only of sentences of which CLARE could parse the uncorrupted versions. However, the figures presented here give grounds to believe that false positives - a wrong &amp;quot;correction&amp;quot; causing a spurious parse of an unparsable original - should be rare. If the replacement of one word by another only rarely maps one sentence inside coverage to another, then a corresponding replacement on a sentence outside coverage should yield something within coverage even more rarely.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML