File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1053_metho.xml
Size: 4,873 bytes
Last Modified: 2025-10-06 14:14:27
<?xml version="1.0" standalone="yes"?> <Paper uid="X96-1053"> <Title>MITRE: DESCRIPTION OF THE ALEMBIC SYSTEM AS USED IN MET</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> MITRE: DESCRIPTION OF THE ALEMBIC SYSTEM AS USED IN MET </SectionTitle> <Paragraph position="0"> {aberdeen, john, day, lynette, palmer, parann, mbv}@mitre.org Alembic is a comprehensive information extraction system that has been applied to a range of tasks.</Paragraph> <Paragraph position="1"> These include the now-standard components of the formal MOC evaluations: name tagging (NE in MUC-6), name normalization (WE), and template generation (ST). The system has also been exploited to help segment and index broadcast video and was used for early experiments on variants of the co-reference identification task. (For details, see \[1\].) For MET, we were of course primarily concerned with the foundational name-tagging task; many downstream modules of the system were left unused.</Paragraph> <Paragraph position="2"> The punchline, as we see it, is that Alembic performed exceptionally well at all three of the MET languages despite having no native speakers for any of them among its development team. We were one of only two sites that attempted all three languages, and were the only group that exploited essentially the same body of code for all three tasks.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> RULE SEQUENCES </SectionTitle> <Paragraph position="0"> The crux of our approach is the use of rule sequences, a processing strategy that was recently popularized by Eric Brill for part-of-speech tagging \[2\]. In a rule sequence processor, the object is to sequentially relabel a body of text according to an ordered rule set. The rules are evaluated in order, and each rule is allowed to run to completion only once in the course of processing. The result is an iteratively-improved labelling of the source text. In the name-tagging task, for example, the process begins with an approximate initial labelling, whose purpose is simply to find the rough boundaries of names and other MET-relevant forms, such as money.</Paragraph> <Paragraph position="1"> This rough labelling is then improved by applying a rule sequence. Individual rules then refine the initial rough boundaries, determine the type of a phrase (person, location, etc.), or merge fragmented phrases into larger units. See Figure 1 below.</Paragraph> <Paragraph position="2"> The rules themselves are simple. The two below come from the actual sequence for Spanish MET.</Paragraph> <Paragraph position="3"> First, the initial labelling breaks the string into components on the basis of part-of-speech taggings: < none>Associaci6n</none> de < none>Mutuales Israelitas Argentinas</none> The first rule searches for organizational head nouns, e.g., &quot;associaci6n&quot; and others, and marks any matching phrase as an organization (ORGEX in our local MET dialec0. This yields the partial relabelling:</Paragraph> </Section> <Section position="3" start_page="0" end_page="461" type="metho"> <SectionTitle> MET-SPECIFIC DEVELOPMENT </SectionTitle> <Paragraph position="0"> In the course of MET, we ported the Alembic name tagger to all three of the target languages. We did so with essentially no guidance from native speakers of any of these languages. For Spanish, two of us collaborated to develop a rule sequence by hand; to this task, one of us brought two semesters of college Spanish, and the other brought fluency in French.</Paragraph> <Paragraph position="1"> With help from a good dictionary and atlas, we were able to understand the training texts well enough to grasp their critical semantics, or as much of the semantics as was needed for the purpose of name tagging. For Japanese, one of us taught himself to read Kanji at a fifth-grade level, and developed a name-tagging sequence through repeated scrutiny of lone Japanese-MET developer had only passing understanding of the texts he was reading. The development process for him consisted largely of Kanji pattern-matching (as opposed to bona fide reading). Finally, for Chinese, we had not even the limited reading ability available for Japanese. Aside from date and money patterns, the entirety of the Chinese rule sequence was acquired through a machine learning process.</Paragraph> <Paragraph position="2"> Besides these rule sequences, several language-specific extensions were required to port Alembic to MET. As we needed to segment Chinese and Japanese texts into separate tokens we adapted the NEW-JUMAN tagger/segmenter for Japanese, and the NMSU segmenter for Chinese. In addition, our Spanish system exploited a Spanish part-of-speech tagger that we had developed previously.</Paragraph> </Section> class="xml-element"></Paper>