File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/97/p97-1004_concl.xml
Size: 5,154 bytes
Last Modified: 2025-10-06 13:57:45
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1004"> <Title>Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax*</Title> <Section position="8" start_page="27" end_page="29" type="concl"> <SectionTitle> 7 Evaluation </SectionTitle> <Paragraph position="0"> The precision and recall of the extraction of term variants are given in Table 4 where precision is the ratio of correct variants among the variants extracted and the recall is the ratio of variants retrieved among the collocates. Results were obtained through a manual inspection of 1,579 Type 1 variants, 823 Type 2 variants, 3,509 Type 1 collocates, and 2,104 Type 2 collocates extracted from the \[AGR\] corpus and the AGROVOC term list.</Paragraph> <Paragraph position="1"> These results indicate a very high level of accuracy: 89.4% of the variants extracted by the system are correct ones. Errors generally correspond to a semantic discrepancy between a word and its morphologically derived form. For example, dlevde pour un sol (literally: high for a soil) is not a correct variant of dlevage hors sol (off-soil breeding) because dlevde and dlevage are morphologically related to two different senses of the verb dlever:, dlevde derives from the meaning to raise whereas dlevage derives from to breed. Recall is weaker than precision because only 75.2% of the possible variants are retrieved.</Paragraph> <Section position="1" start_page="27" end_page="29" type="sub_section"> <SectionTitle> Improvement of Indexing through Variant Extraction </SectionTitle> <Paragraph position="0"> For a better understanding of the importance of term expansion, we now compare term indexing with dchange ionique (ionic exchange) N to A cultures primaires de cellules (primary cell cultures) Modif. propridtds physiques et chimiques Coor.</Paragraph> <Paragraph position="1"> (chemical and physical properties) gestion de l'eau (management of the water) Comp. eau et de l'dvaporation de surface Coor.</Paragraph> <Paragraph position="2"> (water and of surface evaporation \[incorrect variant\]) palmier d huile (palm tree \[yielding oil\]) N to N initier des bourgeons N to V (initiate buds) and without variant expansion. The \[AGR\] corpus has been indexed with the AGROVOC thesaurus in two different ways: 1. Simple indexing: Extraction of occurrences of multi-word terms without considering variation. 2. Rich indexing: Simple indexing improved with the extraction of variants of multi-word terms. Both indexings have been manually checked. Simple indexing is almost error-free but does not cover term variants. On the contrary, rich indexing is slightly less accurate but recall is much higher. Both methods are compared by calculating the effectiveness measure (Van Rijsbergen, 1975):</Paragraph> <Paragraph position="4"> meter which is close to 1 if precision is preferred to recall. The value of E~ varies from 0 to 1; E~ is close to 0 when all the relevant conflations are made and when no incorrect one is made.</Paragraph> <Paragraph position="5"> The effectiveness of rich indexing is more than three times better than effectiveness of simple indexing. Retrieved variants increase the number of indexing items by 28.8% (17.3% Type 1 variants and 11.5% Type 2 variants). Thus, term variant extraction is a significant expansion factor for identifying morphologically and syntactically related multi-word terms in a document without introducing undesirable noise.</Paragraph> <Paragraph position="6"> As for performance, the parser is fast enough for processing large amounts of textual data due to the presence of several optimization devices. On a Pentium133 with Linux, the parser processes 18,100 words/min from an initial list of 4,300 terms.</Paragraph> <Paragraph position="7"> Conclusion This paper has proposed a syntax-based approach via morphologically derived forms for the identification and extraction of multi-word term variants. In using a list of controlled terms coupled with a syntactic analyzer, the method is more precise than traditional text simplification methods. Iterative experimental tuning has resulted in wide-coverage linguistic description incorporating the most frequent linguistic phenomena.</Paragraph> <Paragraph position="8"> Evaluations indicate that, by accounting for term variation using corpus tagging, morphological derivation, and transformation-based rules, 28.8% more can be identified than with a traditional indexer which cannot account for variation. Applications to be explored in future research involve the incorporation Of the system as part of the indexing module of an IR system, to be able to accurately measure improvements in system coverage as well as areas of possible degradation. We also plan to explore analysis of semantic variants through a predicative representation of term semantics. Our results so far indicate that using computational linguistic techniques for carefully controlled term expansion will permit at least a three-fold expansion for coverage over traditional indexing, which should improve retrieval resuits accordingly.</Paragraph> </Section> </Section> class="xml-element"></Paper>