File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-0513_intro.xml
Size: 4,487 bytes
Last Modified: 2025-10-06 14:01:11
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0513"> <Title>Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem?</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Previous Approaches </SectionTitle> <Paragraph position="0"> For decades, researchers have explored various techniques for identifying interesting collocations.</Paragraph> <Paragraph position="1"> There have essentially been three separate kinds of approaches for accomplishing this task. These approaches could be broadly classified into (1) segmentation-based, (2) word-based and knowledgedriven, or (3) word-based and probabilistic. We will illustrate strategies that have been attempted in each of the approaches. Since we assume knowledge of whitespace, and since many of the first and all of the second categories rely upon human input, we will be most interested in the third category.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Segmentation-driven Strategies </SectionTitle> <Paragraph position="0"> Some researchers view MWU-finding as a natural by-product of segmentation. One can regard text as a stream of symbols and segmentation as a means of placing delimiters in that stream so as to separate logical groupings of symbols from one another. A segmentation process may find that a symbol stream should not be delimited even though subcomponents of the stream have been seen elsewhere. In such cases, these larger units may be MWUs.</Paragraph> <Paragraph position="1"> The principal work on segmentation has focused either on identifying words in phonetic streams (Saffran, et. al, 1996; Brent, 1996; de Marcken, 1996) or on tokenizing Asian and Indian languages that do not normally include word delimiters in their orthography (Sproat, et al, 1996; Ponte and Croft 1996; Shimohata, 1997; Teahan, et al., 2000; and many others). Such efforts have employed various strategies for segmentation, including the use of hidden Markov models, minimum description length, dictionary-based approaches, probabilistic automata, transformation-based learning, and text compression. Some of these approaches require significant sources of human knowledge, though others, especially those that follow data compression or HMM schemes, do not.</Paragraph> <Paragraph position="2"> These approaches could be applied to languages where word delimiters exist (such as in European languages delimited by the space character).</Paragraph> <Paragraph position="3"> However, in such languages, it seems more prudent to simply take advantage of delimiters rather than introducing potential errors by trying to find word boundaries while ignoring knowledge of the level and identify appropriate word combinations.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Word-based, knowledge-driven Strategies </SectionTitle> <Paragraph position="0"> Some researchers start with words and propose MWU induction methods that make use of parts of speech, lexicons, syntax or other linguistic structure (Justeson and Katz, 1995; Jacquemin, et al., 1997; Daille, 1996). For example, Justeson and Katz indicated that the patterns NOUN NOUN and ADJ NOUN are very typical of MWUs. Daille also suggests that in French, technical MWUs follow patterns such as &quot;NOUN de NOUN&quot; (1996, p. 50). To find word combinations that satisfy such patterns in both of these situations necessitates the use of a lexicon equipped with part of speech tags. Since we are interested in knowledge-free induction of MWUs, these approaches are less directly related to our work. Furthermore, we are not really interested in identifying constructs such as general noun phrases as the above rules might generate, but rather, in finding only those collocations that one would typically need to define.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Word-based, Probabilistic Approaches </SectionTitle> <Paragraph position="0"> The third category assumes at most whitespace and punctuation knowledge and attempts to infer MWUs using word combination probabilities.</Paragraph> <Paragraph position="1"> Table 1 (see next page) shows nine commonly-used probabilistic MWU-induction approaches. In the table, f and P signify frequency and probability</Paragraph> </Section> </Section> class="xml-element"></Paper>