File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1219_metho.xml
Size: 9,234 bytes
Last Modified: 2025-10-06 14:09:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1219"> <Title>Smoothing Technniques for Language</Title> <Section position="4" start_page="111" end_page="111" type="metho"> <SectionTitle> * Word Formation Pattern (F WFP </SectionTitle> <Paragraph position="0"> ): The purpose of this feature is to capture capitalization, digitalization and other word formation information. In this paper, the same feature as in Shen et al 2003 is used.</Paragraph> </Section> <Section position="5" start_page="111" end_page="111" type="metho"> <SectionTitle> * Morphological Pattern (F MP ): Morphological </SectionTitle> <Paragraph position="0"> information, such as prefix and suffix, is considered as an important cue for terminology identification. Same as Shen et al 2003, we use a statistical method to get the most useful prefixes/suffixes from the training data.</Paragraph> </Section> <Section position="6" start_page="111" end_page="111" type="metho"> <SectionTitle> * Part-of-Speech (F POS </SectionTitle> <Paragraph position="0"> ): Since many of the words in biomedical entity names are in lowercase, capitalization information in the biomedical domain is not as evidential as that in the newswire domain. Moreover, many biomedical entity names are descriptive and very long. Therefore, POS may provide useful evidence about the boundaries of biomedical entity names. In the baseline system, an out-domain POS using the PENN TreeBank is applied.</Paragraph> </Section> <Section position="7" start_page="111" end_page="111" type="metho"> <SectionTitle> * Head Noun Trigger (F HEAD </SectionTitle> <Paragraph position="0"> ): The head noun, which is the major noun of a noun phrase, often describes the function or the property of the noun phrase. In this paper, we automatically extract unigram and bigram head nouns from the training data, and rank them by frequency. For each entity class, we select 50% of top ranked head nouns as head noun triggers.</Paragraph> </Section> <Section position="8" start_page="111" end_page="111" type="metho"> <SectionTitle> 2. Deep Knowledge Resources </SectionTitle> <Paragraph position="0"> Besides the widely used lexical-level features as described above, we also explore the name alias phenomenon, the cascaded entity name phenomenon, the use of both a closed dictionary from the training corpus and an open dictionary from the database term list SwissProt and the alias list LocusLink, the abbreviation resolution and in-domain POS using the GENIA corpus.</Paragraph> <Section position="1" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 2.1 Name Alias Resolution </SectionTitle> <Paragraph position="0"> A novel name alias feature is proposed to resolve the name alias phenomenon. The intuition behind this feature is the name alias phenomenon that relevant entities will be referred to in many ways throughout a given text and thus success of named entity recognition is conditional on success at determining when one noun phrase refers to the very same entity as another noun phrase.</Paragraph> <Paragraph position="1"> During decoding, the entity names already recognized from the previous sentences of the document are stored in a list. When the system encounters an entity name candidate (e.g. a word with a special word formation pattern), a name alias algorithm (similar to Schwartz et al 2003) is invoked to first dynamically determine whether the entity name candidate might be alias for a previously recognized name in the recognized list.</Paragraph> <Paragraph position="2"> The name alias feature F</Paragraph> </Section> </Section> <Section position="9" start_page="111" end_page="111" type="metho"> <SectionTitle> ALIAS </SectionTitle> <Paragraph position="0"> is represented as ENTITYnLm (L indicates the locality of the name alias phenomenon). Here ENTITY indicates the class of the recognized entity name and n indicates the number of the words in the recognized entity name while m indicates the number of the words in the recognized entity name from which the name alias candidate is formed. For example, when the decoding process encounters the word &quot;TCF&quot;, the word &quot;TCF&quot; is proposed as an entity name candidate and the name alias algorithm is invoked to check if the word &quot;TCF&quot; is an alias of a recognized named entity. If &quot;T cell Factor&quot; is a &quot;Protein&quot; name recognized earlier in the document, the word &quot;TCF&quot; is determined as an alias of &quot;T cell Factor&quot; with the name alias feature Protein3L3 by taking the three initial letters of the three-word &quot;protein&quot; name &quot;T cell Factor&quot;.</Paragraph> <Section position="1" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 2.2 Cascaded Entity Name Resolution </SectionTitle> <Paragraph position="0"> It is found (Shen et al 2003) that 16.57% of entity names in GENIA V3.0 have cascaded constructions, e.g.</Paragraph> <Paragraph position="1"> <RNA><DNA>CIITA</DNA> mRNA</RNA>.</Paragraph> <Paragraph position="2"> Therefore, it is important to resolve such phenomenon.</Paragraph> <Paragraph position="3"> Here, a pattern-based module is proposed to resolve the cascaded entity names while the above HMM is applied to recognize embedded entity names and non-cascaded entity names. In the GENIA corpus, we find that there are six useful patterns of cascaded entity name constructions:</Paragraph> <Paragraph position="5"> head noun In our experiments, all the rules of above six patterns are extracted from the cascaded entity names in the GENIA V3.0 to deal with the cascaded entity name phenomenon where the <ENTITY> above is restricted to the five categories in the shared task: Protein, DNA, RNA, CellLine, CellType.</Paragraph> </Section> <Section position="2" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 2.3 Abbreviation Resolution </SectionTitle> <Paragraph position="0"> While the name alias feature is useful to detect the inter-sentential name alias phenomenon, it is unable to identify the inner-sentential name alias phenomenon: the inner-sentential abbreviation.</Paragraph> <Paragraph position="1"> Such abbreviations widely occur in the biomedical domain.</Paragraph> <Paragraph position="2"> In our system, we present an effective and efficient algorithm to recognize the inner-sentential abbreviations more accurately by mapping them to their full expanded forms. In the GENIA corpus, we observe that the expanded form and its abbreviation often occur together via parentheses. Generally, there are two patterns: &quot;expanded form (abbreviation)&quot; and &quot;abbreviation (expanded form)&quot;.</Paragraph> <Paragraph position="3"> Our algorithm is based on the fact that it is much harder to classify an abbreviation than its expanded form. Generally, the expanded form is more evidential than its abbreviation to determine its class. The algorithm works as follows: Given a sentence with parentheses, we use a similar algorithm as in Schwartz et al (2003) to determine whether it is an abbreviation with parentheses. If yes, we remove the abbreviation and the parentheses from the sentence. After the sentence is processed, we restore the abbreviation with parentheses to its original position in the sentence. Then, the abbreviation is classified as the same class of the expanded form, if the expanded form is recognized as an entity name. In the meanwhile, we also adjust the boundaries of the expanded form according to the abbreviation, if necessary. Finally, the expanded form and its abbreviation are stored in the recognized list of biomedical entity names from the document to help the resolution of forthcoming occurrences of the same abbreviation in the document.</Paragraph> </Section> <Section position="3" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 2.4 Dictionary </SectionTitle> <Paragraph position="0"> In our system, two different features are explored to capture the existence of an entity name in a closed dictionary and an open dictionary. Here, the closed dictionary is constructed by extracting all entity names from the training data while the open dictionary (~700,000 entries) is combined from the database term list Swissport and the alias list LocusLink. The closed dictionary feature is represented as ClosedENTITYn (Here ENTITY indicates the class of the entity name and n indicates the number of the words in the entity name) while the open dictionary feature is represented as Openn (Here n indicates the number of the words in the entity name. We don't differentiate the class of the entity name since the open dictionary only contains protein/gene names and their aliases).</Paragraph> </Section> </Section> <Section position="10" start_page="111" end_page="111" type="metho"> <SectionTitle> 2.5 In-domain POS </SectionTitle> <Paragraph position="0"> We also examine the impact of an in-domain POS feature instead of an out-domain POS feature which is trained on PENN TreeBank. Here, the in-domain POS is trained on the GENIA corpus V3.02p.</Paragraph> </Section> class="xml-element"></Paper>