File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1057_intro.xml
Size: 2,441 bytes
Last Modified: 2025-10-06 14:00:48
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1057"> <Title>Robust Segmentation of Japanese Text into a Lattice for Parsing</Title> <Section position="3" start_page="390" end_page="390" type="intro"> <SectionTitle> 2 Recall vsdeg Precision </SectionTitle> <Paragraph position="0"> In this architecture, data is fed forward from one COlnponent to the next; hence, it is crucial that the base components (like the segmenter) generate a minimal number of omission errors.</Paragraph> <Paragraph position="1"> Since segmentation errors may affect subsequent components, it is convenient to divide these errors into two types: recoverable and non-recoverable.</Paragraph> <Paragraph position="2"> A ram-recoverable error is one that prevents the syntax (or any downstream) component from arriving at a correct analysis (e.g., a missing record). A recoverable error is one that does not interfere with the operation of following components. An example of the latter is the inchision of an extra record. This extra record does not (theoretically) prevent the parser from doing its lob (although in practice it may since it eonsunles resotlrces).</Paragraph> <Paragraph position="3"> Using standard definitions of recall (R) and precision (P):</Paragraph> <Paragraph position="5"> where Segcor~ec t and .<,egmxal are the number q/&quot; &quot;'cotwect&quot; and total number o/'segments returned by the segmentet; and &quot;\['agto~a I is the total Jlttmber of &quot;correct&quot; segments fi'om a tagged corpus, we can see that recall measures non-recoverable errors and precision measures recoverable errors.</Paragraph> <Paragraph position="6"> Since our goal is to create a robust NL system, it behooves us to maximize recall (i.e., make very few non-recoverable errors) in open text while keeping precision high enough that the extra records (recoverable errors) do not interfere with the parsing component.</Paragraph> <Paragraph position="7"> Achieving near-100% recall might initially seem to be a relatively straightforward task given a sufficiently large lexicon - simply return every possible record that is found in the input string, in practice, tile mixture of scripts and flexible orthography rules of Japanese (in addition to the inevitable non-lexicalized words) make the task of identifying potential lexical boundaries an interesting problem in its own right.</Paragraph> </Section> class="xml-element"></Paper>