File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1055_metho.xml
Size: 7,866 bytes
Last Modified: 2025-10-06 14:14:27
<?xml version="1.0" standalone="yes"?> <Paper uid="X96-1055"> <Title>Ap~tion Feat~es~ Message Message Reader I Morphological Analyzer \[ Lexical Pattern Matcher \[ SGMI./Annotation Generator \[ OAt Format & SGML Handling Identification of Entities Output Entities</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> APPROACHES IN MET (MULTI-LINGUAL ENTITY TASK) </SectionTitle> <Paragraph position="0"> language task for MET. BBN also participated in Chinese. We also fielded two approaches. The first approach is pattern based and has an architecture as shown in Figure 1. This approach was applied to both Chinese and Spanish. The algorithms (rectangles in the Figure) were used in the two languages; the only component difference was the New Mexico State University segmenter, used to find the word boundaries in Chinese. The components common to both languages are the message reader, which dealt with the input format and SGML conventions via a declarative format description; the part-of-speech tagger (BBN POST); a lexical pattern matcher driven by knowledge bases of patterns and lexicons specific to each language; and the SGML annotation generator. While not shown in Figure 1, an alias prediction algorithm was shared by both languages, using patterns unique to each language.</Paragraph> <Paragraph position="1"> A second approach based on statistical learning was used to create a learned Spanish namefinder. One component is a training module that learns to recognize the MET categories from examples. The understanding module uses the model developed from training to predict the MET categories in new input sentences.</Paragraph> <Paragraph position="2"> Data annotated with the correct answers was provided by the government in its training materials. In addition, we annotated some additional data. The current probability model is a hidden Markov model (HMM) which is more complex than is typically used in part-of-speech tagging and is therefore more general.</Paragraph> </Section> <Section position="2" start_page="0" end_page="465" type="metho"> <SectionTitle> 2. CHALLENGES AND STRENGTHS IN OUR APPROACH TO CHINESE </SectionTitle> <Paragraph position="0"> One of the challenges in processing Chinese is the difficulty of word segmentation. Segmentation in Chinese seems more difficult than in Japanese. With Japanese, changes in the character sets used in running text can be used to detect many of the word boundaries.</Paragraph> <Paragraph position="1"> The use of the part-of-speech tagger was both a strength and a weakness in Chinese. The part-of-speech labels proved useful in finding boundaries such as those between organization names and text which is not one of the MET categories. However, part-of-speech labeling in Chinese is more of a challenge than in the other languages because of two factors: * Chinese has very little inflection and no capitalization, thereby offering less evidence to predict the category of an unknown word.</Paragraph> <Paragraph position="2"> Given that there was not a large dictionary of Chinese words with parts-of-speech, a high percentage of words in the text were unknown.</Paragraph> <Paragraph position="3"> Another strength and challenge in Chinese is the fact that several of the categories are interrelated. For instance, locations often mark the start of an organization name and persons may start an organization name. In addition, different categories will occur contiguously, so that correctly recognizing a category is needed to locate the others. For example,, a location name, a title of a person, and a person name often will co-occur. This creates a challenge in getting started since several of the patterns look for distributed categories. The strength is that once significant progress is made in one, such as location names, it can contribute to improved performance in the other categories.</Paragraph> <Paragraph position="4"> The final general challenge is represented by the lack of available linguistics resources for Chinese.</Paragraph> </Section> <Section position="3" start_page="465" end_page="465" type="metho"> <SectionTitle> 3. CHALLENGES AND STRENGTHS IN SPANISH </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="465" end_page="465" type="sub_section"> <SectionTitle> 3.1 Using manually constructed </SectionTitle> <Paragraph position="0"> patterns One of the challenges was self-imposed: because we were interested in seeing how far the technology could go without purchased linguistics resources, we restricted ourselves to using only prelinguistics resources. Some of the techniques we used are therefore applicable in all languages where significant amounts of online text are available. Patrick Jost was very effective in mining available online data to find very large lists of person names, critical vocabulary items, and organization names. A second challenge was that we had very little effort to devote to the manual system in Spanish; in fact, after a certain point there was insufficient effort available to track the evolving set of guidelines for Spanish. One strength in the effort was that the presence of lower case words in Spanish names (and the generally unreliable use of capitalization in the names) was straightforwardly handled by the patterns and did not pose a difficulty as we would have anticipated.</Paragraph> </Section> <Section position="2" start_page="465" end_page="465" type="sub_section"> <SectionTitle> 3.2 Using a Learned System </SectionTitle> <Paragraph position="0"> There are several pleasant surprises corresponding to strengths in the learned system as applied to Spanish.</Paragraph> <Paragraph position="1"> First the learned system could be retrained in a matter of five or ten minutes. Therefore, changes to the model could be quickly tested. The fact that the government released the revised training data very late in the cycle of MET did not pose a problem, since the system could be retrained so quickly with the updated training data.</Paragraph> <Paragraph position="2"> The learned system and model we used proved to be highly portable to a new language. The original training and understanding modules were not completed until the first half of March. Results were very positive in English. When we first trained and tested the same model in Spanish, the results were so encouraging that we decided in April to enter the learned system in MET.</Paragraph> <Paragraph position="3"> The third strength we found was the use of contextual probabilities to predict from the previous word and previous category the likelihood of the next word and the next category.</Paragraph> <Paragraph position="4"> The major challenge is to make the resulting large statistical model more understandable by humans, so that intuitions can be used to improve it.</Paragraph> </Section> </Section> <Section position="4" start_page="465" end_page="465" type="metho"> <SectionTitle> 4. LESSONS LEARNED </SectionTitle> <Paragraph position="0"> We learned the following lessons: * High performances are possible using one approach across several languages.</Paragraph> <Paragraph position="1"> * Text can be mined using simple techniques (such as regular expression patterns) to effectively find critical vocabulary items.</Paragraph> <Paragraph position="2"> * The gap between manually constructed systems using patterns and learned systems is shrinking dramatically.</Paragraph> <Paragraph position="3"> * Probabilistic, learned approaches can be developed in a short amount of time.</Paragraph> <Paragraph position="4"> * Probabilistic finite state models, which had been previously successful in continuous speech recognition and in part-of-speech tagging, can be applied successfully to multilingual entity finding.</Paragraph> </Section> class="xml-element"></Paper>